Design overview

This section covers general information about SpiCE, including how the corpus was collected, and how the transcriptions were developed. Much of this is also covered in the open-access paper describing the corpus.

A phonological corpus

What's that? A phonological corpus for spoken language has audio recordings, linguistic annotations at the level of the word and phone, as well as metadata. A phonological corpus should also be representative of the selected population, big enough, and collected for a purpose. A good source of more general information on the topic is The Oxford Handbook of Corpus Phonology.

SpiCE is a phonological corpus, albeit one with force-aligned phones. A future release might include hand-corrected phones, but that's a big ❓ at the moment. If you're interested in contributing to the corpus—hand-correction or anything else—please get in touch!

The speech community

The Cantonese-speaking community in Metro Vancouver is a unique bilingual community. Not only is Cantonese very widely spoken in the area, it has been for a long time, and by a heterogeneous group of people. Statistics Canada has some useful visualizations for getting a broad picture of the linguistic landscape—in particular: Proportion of mother tongue responses for various regions in Canada from the 2016 Census. Needless to say, there is a lot more that could be said here!

Participant recruitment

Thirty-four early Cantonese-English bilinguals were recruited from the UBC community in Metro Vancouver, BC, Canada, using language like:

Do you speak Cantonese and English? You can be part of a bilingual speech database. We’re studying what makes bilingual speech unique—specifically, how languages influence one another. We are looking for fluent speakers of Cantonese and English, between the ages of 19 and 35 (inclusive), with normal speech and hearing. The study involves two conversational interviews on everyday topics like culture, hobbies, school, community issues, or work. What’s the catch? One interview will be conducted in Cantonese, and the other in English. Both interviews will be recorded and included in an Open Access database, so researchers and developers around the world can play with and learn from your speech. Participation lasts approximately 1.5 hours at UBC (6368 Stores Road, Vancouver, BC). You will be compensated $15 (or receive partial course credit instead, upon request).

Participants were recruited from October 2018 through March 2020 via word of mouth, social media, clubs, the UBC linguistics subject pool, and other similar methods. A detailed summary of the participants' language background information is provided in the corpus download. The group of participants recruited for SpiCE reflects the heterogeneity in the speech community.