|
1 |
| -# Benchmark Language Detection Libraries |
2 |
| - |
3 |
| - > Benchmark done on tatoeba dataset (~9M sentences) on 16 of the most common languages. |
4 |
| - |
5 |
| -### **Remark** |
| 1 | +# NodeJS Language Detection Benchmark :rocket: |
6 | 2 |
|
| 3 | +- This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances |
7 | 4 | - Language evaluated in this benchmark:
|
8 | 5 | - Asia: `jpn`, `cmn`, `kor`, `hin`
|
9 | 6 | - Europe: `fra`, `spa`, `por`, `ita`, `nld`, `eng`, `deu`, `fin`, `rus`
|
10 | 7 | - Middle east: , `tur`, `heb`, `ara`
|
11 |
| -- This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances |
| 8 | +- This page and graphs are auto-generated from the code |
12 | 9 |
|
13 |
| -## Overall |
| 10 | +--- |
| 11 | + |
| 12 | +## Libraries |
| 13 | + |
| 14 | +Here is the list of libraries in this benchmark |
14 | 15 |
|
15 | 16 | | Library | Script | Properly Identified | Improperly identified | Not identified | Avg Execution Time | Disk Size |
|
16 | 17 | | -------------- | --------------------------- | ------------------- | --------------------- | -------------- | ------------------ | --------- |
|
17 |
| -| **TinyLD** | `yarn bench:tinyld` | 97.7311% | 1.9247% | 0.3441% | 0.0966ms. | 930KB | |
18 |
| -| **TinyLD Web** | `yarn bench:tinyld-light` | 97.4512% | 2.1131% | 0.4358% | 0.0802ms. | **110KB** | |
19 |
| -| **langdetect** | `yarn bench:langdetect` | 94.879% | 5.1093% | 0.0117% | 0.4631ms. | 1.8MB | |
20 |
| -| node-cld | `yarn bench:cld` | 88.9148% | 1.7489% | 9.3363% | 0.0477ms. | > 10MB | |
21 |
| -| node-lingua | `yarn bench:lingua` | 86.3093% | 0.13% | 13.5607% | 1.5695ms. | ~100MB | |
22 |
| -| franc | `yarn bench:franc` | 68.7783% | 26.3432% | 4.8785% | 0.1626ms. | 267KB | |
23 |
| -| franc-min | `yarn bench:franc-min` | 65.5163% | 23.5794% | 10.9044% | 0.0734ms. | **119KB** | |
24 |
| -| franc-all | `yarn bench:franc-all` | 61.7893% | 33.3322% | 4.8785% | 0.3688ms. | 509KB | |
25 |
| -| languagedetect | `yarn bench:languagedetect` | 61.6068% | 12.295% | 26.0982% | 0.212ms. | **240KB** | |
26 |
| - |
27 |
| -which gives us the following graph |
| 18 | +| **TinyLD** | `yarn bench:tinyld` | 98.274% | 1.6277% | 0.0983% | 0.1002ms. | 930KB | |
| 19 | +| **TinyLD Web** | `yarn bench:tinyld-light` | 97.8901% | 1.9765% | 0.1334% | 0.0868ms. | **110KB** | |
| 20 | +| **langdetect** | `yarn bench:langdetect` | 95.6623% | 4.3377% | 0% | 0.584ms. | 1.8MB | |
| 21 | +| node-cld | `yarn bench:cld` | 92.2917% | 1.6375% | 6.0708% | 0.0687ms. | > 10MB | |
| 22 | +| node-lingua | `yarn bench:lingua` | 87.1399% | 0.069% | 12.7912% | 1.1357ms. | ~100MB | |
| 23 | +| franc | `yarn bench:franc` | 74.0083% | 25.9917% | 0% | 0.1947ms. | 267KB | |
| 24 | +| franc-min | `yarn bench:franc-min` | 70.1002% | 23.4164% | 6.4834% | 0.0914ms. | **119KB** | |
| 25 | +| franc-all | `yarn bench:franc-all` | 66.3846% | 33.6154% | 0% | 0.576ms. | 509KB | |
| 26 | +| languagedetect | `yarn bench:languagedetect` | 65.9069% | 11.3828% | 22.7103% | 0.2643ms. | **240KB** | |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +## Global Accuracy |
| 31 | + |
28 | 32 | 
|
29 | 33 |
|
| 34 | +We see two group of libraries (separated by `node-lingua` in the middle) |
| 35 | + |
| 36 | +- `tinyld`, `langdetect` and `cld` over 90% accuracy |
| 37 | +- `franc` and `languagedetect` under 75% accuracy |
| 38 | + |
30 | 39 | ## Per Language
|
31 | 40 |
|
32 |
| -Let's now compare those libraries per language |
33 | 41 | 
|
34 | 42 |
|
35 |
| -### **Conclusion** |
| 43 | +We see big differences between languages: |
| 44 | + |
| 45 | +- **Japanese** or **Korean** are almost at 100% for every libs (lot of unique characters) |
| 46 | +- **Spanish** and **Portuguese** are really close and cause more false-positive and an higher error-rate |
| 47 | + |
| 48 | +## Accuracy By Text length |
| 49 | + |
| 50 | +Most libraries are using statistical analysis, so longer is the input text, better will be the detection. |
| 51 | +So we can often see quotes like this in those library documentations. |
| 52 | + |
| 53 | +> Make sure to pass it big documents to get reliable results. |
| 54 | +
|
| 55 | +Let's see if this statement is true, and how those libraries behave for different input size (from small to long) |
| 56 | + |
36 | 57 |
|
37 |
| -#### Popular != Performant |
| 58 | +So the previous quote is right, over 512 characters all the libs become accurate enough. |
38 | 59 |
|
39 |
| -- It's quite funny to notice the two most popular nodejs libs `languagedetect` (really high in google) and `franc` (over 4k stars on github) are both at the bottom of the charts. |
40 |
| -- At the opposite `langdetect` which is quite unknown performs really well and is the only one to come close to `tinyld`. |
| 60 | +But for a ~95% accuracy threshold: |
41 | 61 |
|
42 |
| -#### Recommended |
| 62 | +- `tinyld` (green) reaches it around 24 characters |
| 63 | +- `langdetect` (cyan) and `cld` (orange) reach it around 48 characters |
| 64 | +- `lingua` (red) and `franc` (pink) need more than 256 characters to reach it |
| 65 | + |
| 66 | +## Execution Time |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | +Here we can notice few things about performance: |
| 71 | + |
| 72 | +- `node-lingua` (red) collapse at a scary rate |
| 73 | +- `langdetect` (cyan) and `franc` (pink) seems to slow down at a similar rate |
| 74 | +- `tinyld` (green) slow down but at a really flat rate |
| 75 | +- `cld` (orange) is definitely the fastest and doesn't show any apparent slow down |
| 76 | + |
| 77 | +But we've seen previously that some of those libraries need more than 256 characters to be accurate. |
| 78 | +It means they start to slow down at the same time they start to give decent results. |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +## **Conclusion** |
| 83 | + |
| 84 | +### Recommended :thumbsup: |
| 85 | + |
| 86 | +#### - By platform :computer: |
43 | 87 |
|
44 | 88 | - For **NodeJS**: `TinyLD`, `langdetect` or `node-cld` (fast and accurate)
|
45 | 89 | - For **Browser**: `TinyLD Light` or `franc-min` (small, decent accuracy, franc is less accurate but support more languages)
|
46 | 90 |
|
47 |
| -#### Not recommended |
| 91 | +#### - By usage :speech_balloon: |
| 92 | + |
| 93 | +- Short text (chatbot, keywords, database, ...): `TinyLD` or `langdetect` |
| 94 | +- Long text (documents, webpage): `node-cld` or `TinyLD` |
| 95 | + |
| 96 | +### Not recommended :thumbsdown: |
| 97 | + |
| 98 | +- `node-lingua` has decent accuracy but is just too big and slow |
| 99 | +- `franc-all` is the worst in terms of accuracy, not a surprise because it tries to detect 400+ languages with only 3-grams. A technical demo to put big numbers but useless for real usage, even a language like english barely reaches ~45% detection rate. |
| 100 | +- `languagedetect` is light but just not accurate enough |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +## Last word :raising_hand: |
| 105 | + |
| 106 | +Thanks for reading this article, those metrics are really helpful for the development of `tinyld`. |
| 107 | +It's used in the development to see the impact of every modification and features. |
48 | 108 |
|
49 |
| -- `node-lingua` has a quite good accuracy but is just too big and slow |
50 |
| -- `franc-all` is the worse in term of accuracy, not a surprise because it tries to detect 400+ languages with only 3-grams. A technical demo to put big numbers but useless for real usage, even a language like english barely reach ~45% detection rate. |
51 |
| -- `languagedetect` is light but just not accurate enough, really focused on indo-european languages (support kazakh but not chinese, korean or japanese). Interesting fact, it's more accurate than franc on west european languages. |
| 109 | +If you want to contribute or see another library in this benchmark, [open an issue](https://github.com/komodojp/tinyld/issues) |
0 commit comments