👸 BELA¶

BELA (BLIP ELAN Language Annotation) is a pathway for creating and analysing multi-lingual transcripts using BELA convention and ELAN software.

Getting started¶

BELA is available on PyPI and can be installed using pip:

pip install bela

Sample code¶

The following code snippet reads a BELA transcript and prints out all participants and their utterances & chunks.

import bela

b2 = bela.read_eaf("my_bela_filename.eaf")
for person in b2.persons:
    print(person.name, person.code)
    for u in person.utterances:
        print(u, u.from_ts, u.to_ts, u.duration)
        if u.translation:
            print(u.translation)
        for c in u.chunks:
            print(f"  - {c} [{c.language}]")

BELA Tutorials¶

To be updated.

For BELA API reference, please visit BELA API reference page.

BELA API reference¶

For most people, bela.read_eaf() is the first thing to look at. This function returns a bela.Bela2 object for manipulating a BELA transcript directly:

>>> import bela
>>> b2 = bela.read_eaf("my_bela_filename.eaf")

Now you can use the created b2 object to process BELA data.

>>> for person in b2.persons:
>>>     print(person.name, person.code)
>>>     for u in person.utterances:
>>>         print(u, u.from_ts, u.to_ts, u.duration)
>>>         if u.translation:
>>>             print(u.translation)
>>>         for c in u.chunks:
>>>             print(f"  - {c} [{c.language}]")

The bela module¶

bela.read_eaf(eaf_path, **kwargs)¶

Read an EAF file as a Bela2 object

Parameters

eaf_path (str-like object or a Path object) – Path to the EAF file

Returns

A Bela2 object

Return type

bela.Bela2

bela.from_elan(elan, eaf_path=':memory:', **kwargs)¶

Create a BELA-con version 2.x object from a speach.elan.ELANDoc object

The lex module¶

This module provides lexicon analysis functions (i.e. counting tokens, calculating class-token ratio, et cetera). New users should start with bela.lex.CorpusLexicalAnalyser.

>>> from bela.lex import CorpusLexicalAnalyser
>>> analyser = CorpusLexicalAnalyser()
>>> for person in b2.persons:
>>>     for u in person.utterances:
>>>         analyser.add(u.text, u.language, source=source, speaker=person.code)
>>> analyser.analyse()
class bela.lex.CorpusLexicalAnalyser(filepath=':memory:', lang_lex_map=None, word_only=False, lemmatizer=True, **kwargs)[source]¶

Analyse a corpus text

analyse(external_tokenizer=True)[source]¶

Analyse all available profiles (i.e. speakers)

read(**kwargs)[source]¶

Read the CSV file content specified by self.filepath

to_dict()[source]¶

Export analysed result as a JSON-ready object

BELA-con version 2.0 API¶

The official Bela convention. By default, this should be used for new transcripts.

class bela.Bela2(elan, path=':memory:', allow_empty=False, nlp_tokenizer=False, word_only=True, ellipsis=True, validate_baby_languages=False, ansi_languages=('English', 'Vocal Sounds', 'Malay', 'Red Dot', ':v:airstream', ':v:crying', ':v:vocalizations'), auto_tokenize=True, split_punc=True, remove_punc=True, **kwargs)[source]¶

BELA-convention version 2

find_turns(threshold=1500)[source]¶

Find potential turn-takings

Parameters

threshold (float) – Delay between utterances in milliseconds

Returns

List of utterance pairs (2-tuple) (from utterance, to utterance object)

static from_elan(elan, eaf_path=':memory:', **kwargs)[source]¶

Create a BELA-con version 2.x object from a speach.elan.ELANDoc object

parse_name(tier)[source]¶

(Internal) Parse participant name and tier type from a tier object and then update the tier object

This function is internal and should not be used outside of this class.

Parameters

tier (speach.elan.ELANTier) – The tier object to parse

static read_eaf(eaf_path, **kwargs)[source]¶

Read an EAF file as a Bela2 object

Parameters

eaf_path (str-like object or a Path object) – Path to the EAF file

Returns

A Bela2 object

Return type

bela.Bela2

to_language_mix(to_ts=None, auto_compute=True)[source]¶

Collapse utterances to generate a language mix timeline

tokenize()[source]¶

tokenize all utterances

property annotation¶

Get an annotation object by ID

property participant_codes¶

Immutable list of participant codes

property person_map¶

Map participant (i.e. person code) to person object

property persons¶

All Person objects in this BELA object

property roots¶

Direct access to all underlying ELAN root tiers

BELA-con version 1.0 API¶

Bela1 is deprecated from Mar 2020. It is still available for backward compatible only. Please do not use it for anything other than BLIP’s PILOT10 corpus.

class bela.Bela1[source]¶

This class represent BELA convention version 1

static read(filepath, autotag=True)[source]¶

Read ELAN csv file

to_language_mix(to_ts=None, auto_compute=True)[source]¶

Collapse utterances to generate a language mix timeline

BELA Changelog¶

BELA 2.0.0a22 [WIP]¶

  • Added tokenize() function to utterances and chunks

  • Added the first working prototype of BELA builder (2022-03-29)

  • Added Bela2.save() function

  • Kickstarted BELA documentation

  • Added BELA documentation: https://bela.readthedocs.io/

  • Added ANSI & baby language checking rules

  • use speach >= 0.1a15.post1

  • Exposed read_eaf() and from_elan() to module level

  • Exposed media_file, media_url, relative_media_url properties

  • Fixed None utterances & chunks for not well-formed transcripts

BELA 2.0.0a21¶

  • Use speach > 0.1a14 to support Python 3.10 and 3.11

  • Updated annotation mapping mechanism

  • Warn users if OMW-1.4 dataset is missing

  • Clean up ~ characters after plus-to-space expansion

BELA 2.0.0a19¶

Indices and tables¶