Unicode at Gigabytes per Second [r-libre/2439]

Lemire, Daniel (2021). Unicode at Gigabytes per Second. Dans Lecroq, Thierry et Touzet, Hélène (dir.), SPIRE 2021: String Processing and Information Retrieval. https://doi.org/10.1007/978-3-030-86692-1_2

Fichier(s) associé(s) à ce document :

PDF - spire2021.pdf
Contenu du fichier : Manuscrit soumis (avant évaluation)
Licence : Creative Commons CC BY.

Télécharger

Catégorie de document :	Communications dans des actes de congrès/colloques
Évaluation par un comité de lecture :	Non
Étape de publication :	Publié
Résumé :	We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating systems such as Windows. Software systems frequently have to convert text from one Unicode format to the other. While recent disks have bandwidths of 5 GiB/s or more, conventional approaches transcode non-ASCII text at a fraction of a gigabyte per second. We show that we can validate and transcode Unicode text at gigabytes per second on current systems (x64 and ARM) without sacrificing safety. Our open-source library can be ten times faster than the popular ICU library on non-ASCII strings and even faster on ASCII strings.
Adresse de la version officielle :	https://link.springer.com/chapter/10.1007/978-3-03...
Déposant:	Lemire, Daniel
Responsable :	Daniel Lemire
Dépôt :	15 nov. 2021 13:55
Dernière modification :	20 mai 2023 02:02

Actions (connexion requise)

RÉVISER