Skip to content

Commit aef7d96

Browse files
committed
Update Readme & Benchmark
1 parent 82abac2 commit aef7d96

18 files changed

+477
-134
lines changed

‎Readme.md

+10-9
Original file line numberDiff line numberDiff line change
@@ -7,27 +7,28 @@
77

88
![logo](./banner.png)
99

10+
## :tada: Description
11+
1012
**Tiny** **L**anguage **D**etector, simply detect the language of a unicode UTF-8 text:
1113

12-
- Alternative to libraries like `CLD` or `Franc`
1314
- Pure JS, No api call, No dependencies (Node and Browser compatible)
1415
- Blazing fast and low memory footprint (unlike ML methods)
15-
- Available for both `CommonJS` and `ESM`
16+
- Train with dataset from [Tatoeba](https://tatoeba.org/en/) and [UDHR](https://unicode.org/udhr/)
1617
- Support [64 languages](./docs/langs.md) (24 for [the web version](./docs/light.md))
18+
- Reliable even for really short texts (chatbot, keywords, ...)
1719
- Support both ISO-639-1 & ISO-639-2
20+
- Available for both `CommonJS` and `ESM`
1821

1922
## Links
2023

2124
- [**Playground** - Try the library](https://runkit.com/kefniark/tinyld)
2225
- [Getting Started](./docs/install.md)
23-
- [API](./docs/api.md)
24-
- [CLI](./docs/cli.md)
2526
- [Supported Languages](./docs/langs.md)
2627
- [Algorithm](./docs/algorithm.md)
2728

2829
---
2930

30-
## Getting Started
31+
## :floppy_disk: Getting Started
3132

3233
### Install
3334

@@ -39,7 +40,7 @@ yarn add tinyld # or npm install --save tinyld
3940

4041
---
4142

42-
### **TinyLD API**
43+
### :page_facing_up: **TinyLD API**
4344

4445
```js
4546
import { detect, detectAll } from 'tinyld'
@@ -57,7 +58,7 @@ detectAll('ceci est un text en francais.')
5758

5859
---
5960

60-
### **TinyLD CLI**
61+
### :paperclip: **TinyLD CLI**
6162

6263
```bash
6364
tinyld This is the text that I want to check
@@ -68,7 +69,7 @@ tinyld This is the text that I want to check
6869

6970
---
7071

71-
## Performance
72+
## :chart_with_upwards_trend: Performance
7273

7374
Here is a comparison of **Tinyld** against other popular libraries.
7475

@@ -78,4 +79,4 @@ To summary in one sentence:
7879

7980
> Better, Faster, Smaller
8081
81-
[More Information](./docs/benchmark.md)
82+
[More Benchmark Information](./docs/benchmark.md)

‎docs/algorithm.md

+2-32
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ This library uses a variant of the usual N-gram algorithm, which gives fast and
44

55
Most libraries are directly using a bayesian scoring algorithm to identify a text language. But TinyLD, decided to add few steps before and after, trying to mimic human logic and identify language with their unique character patterns or word usage.
66

7+
This is similar to what ML methods use, that's why this library has a training phase too. The goal is to find which "features" or "n-gram" are the more useful for detection without hardcoding any language specific rules. The heavy lifting is done during build time, so at runtime it can be fast and efficient.
8+
79
## How it works ?
810

911
The string will be split into chunks based on punctuation. Each chunk will be evaluated separately and results merged later weighted with the chunk size.
@@ -73,35 +75,3 @@ For example:
7375
- English accuracy is good thanks to word detection (JA ~1.5% but EN ~98%)
7476

7577
Which is why together those methods get an overall accuracy > 95%
76-
77-
```
78-
--- Per language Accuracy ---
79-
- fra - 91.1079%
80-
- deu - 98.5069%
81-
- eng - 97.507%
82-
- rus - 92.241%
83-
- jpn - 99.96%
84-
- spa - 88.2282%
85-
- por - 94.2674%
86-
- ita - 92.3744%
87-
- cmn - 97.4537%
88-
- ara - 98.9468%
89-
- heb - 100%
90-
- fin - 92.7743%
91-
- tur - 96.0139%
92-
- kor - 99.8933%
93-
- jav - 74.2857%
94-
- hin - 95.8406%
95-
```
96-
97-
---
98-
99-
## How the library can be so small ? (~700KB for node, ~90KB for web)
100-
101-
In a normal n-gram algorithm, between languages there are lot of overlap, duplicates or non specific grams (which match more than 15 languages). Gram are just downloaded and check at runtime, which is not really optimized.
102-
103-
Our multi-pass model allow us to have a really compact language profile file.
104-
105-
- **AOT Logic**: the language profiles are pre-compiled ahead of time (avoid duplicates and not relevant gram/words)
106-
- **Per Language logic**: language really well detected in one method >98% are removed from other methods (with their data)
107-
- **Pass Exclusion**: gram detected in one pass are automatically ignored by later pass, this allow us to greatly reduce the size of n-gram (or words) to store and check

‎docs/benchmark.md

+86-28
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,109 @@
1-
# Benchmark Language Detection Libraries
2-
3-
> Benchmark done on tatoeba dataset (~9M sentences) on 16 of the most common languages.
4-
5-
### **Remark**
1+
# NodeJS Language Detection Benchmark :rocket:
62

3+
- This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances
74
- Language evaluated in this benchmark:
85
- Asia: `jpn`, `cmn`, `kor`, `hin`
96
- Europe: `fra`, `spa`, `por`, `ita`, `nld`, `eng`, `deu`, `fin`, `rus`
107
- Middle east: , `tur`, `heb`, `ara`
11-
- This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances
8+
- This page and graphs are auto-generated from the code
129

13-
## Overall
10+
---
11+
12+
## Libraries
13+
14+
Here is the list of libraries in this benchmark
1415

1516
| Library | Script | Properly Identified | Improperly identified | Not identified | Avg Execution Time | Disk Size |
1617
| -------------- | --------------------------- | ------------------- | --------------------- | -------------- | ------------------ | --------- |
17-
| **TinyLD** | `yarn bench:tinyld` | 97.7311% | 1.9247% | 0.3441% | 0.0966ms. | 930KB |
18-
| **TinyLD Web** | `yarn bench:tinyld-light` | 97.4512% | 2.1131% | 0.4358% | 0.0802ms. | **110KB** |
19-
| **langdetect** | `yarn bench:langdetect` | 94.879% | 5.1093% | 0.0117% | 0.4631ms. | 1.8MB |
20-
| node-cld | `yarn bench:cld` | 88.9148% | 1.7489% | 9.3363% | 0.0477ms. | > 10MB |
21-
| node-lingua | `yarn bench:lingua` | 86.3093% | 0.13% | 13.5607% | 1.5695ms. | ~100MB |
22-
| franc | `yarn bench:franc` | 68.7783% | 26.3432% | 4.8785% | 0.1626ms. | 267KB |
23-
| franc-min | `yarn bench:franc-min` | 65.5163% | 23.5794% | 10.9044% | 0.0734ms. | **119KB** |
24-
| franc-all | `yarn bench:franc-all` | 61.7893% | 33.3322% | 4.8785% | 0.3688ms. | 509KB |
25-
| languagedetect | `yarn bench:languagedetect` | 61.6068% | 12.295% | 26.0982% | 0.212ms. | **240KB** |
26-
27-
which gives us the following graph
18+
| **TinyLD** | `yarn bench:tinyld` | 98.274% | 1.6277% | 0.0983% | 0.1002ms. | 930KB |
19+
| **TinyLD Web** | `yarn bench:tinyld-light` | 97.8901% | 1.9765% | 0.1334% | 0.0868ms. | **110KB** |
20+
| **langdetect** | `yarn bench:langdetect` | 95.6623% | 4.3377% | 0% | 0.584ms. | 1.8MB |
21+
| node-cld | `yarn bench:cld` | 92.2917% | 1.6375% | 6.0708% | 0.0687ms. | > 10MB |
22+
| node-lingua | `yarn bench:lingua` | 87.1399% | 0.069% | 12.7912% | 1.1357ms. | ~100MB |
23+
| franc | `yarn bench:franc` | 74.0083% | 25.9917% | 0% | 0.1947ms. | 267KB |
24+
| franc-min | `yarn bench:franc-min` | 70.1002% | 23.4164% | 6.4834% | 0.0914ms. | **119KB** |
25+
| franc-all | `yarn bench:franc-all` | 66.3846% | 33.6154% | 0% | 0.576ms. | 509KB |
26+
| languagedetect | `yarn bench:languagedetect` | 65.9069% | 11.3828% | 22.7103% | 0.2643ms. | **240KB** |
27+
28+
---
29+
30+
## Global Accuracy
31+
2832
![Benchmark](./overall.svg)
2933

34+
We see two group of libraries (separated by `node-lingua` in the middle)
35+
36+
- `tinyld`, `langdetect` and `cld` over 90% accuracy
37+
- `franc` and `languagedetect` under 75% accuracy
38+
3039
## Per Language
3140

32-
Let's now compare those libraries per language
3341
![Language](./language.svg)
3442

35-
### **Conclusion**
43+
We see big differences between languages:
44+
45+
- **Japanese** or **Korean** are almost at 100% for every libs (lot of unique characters)
46+
- **Spanish** and **Portuguese** are really close and cause more false-positive and an higher error-rate
47+
48+
## Accuracy By Text length
49+
50+
Most libraries are using statistical analysis, so longer is the input text, better will be the detection.
51+
So we can often see quotes like this in those library documentations.
52+
53+
> Make sure to pass it big documents to get reliable results.
54+
55+
Let's see if this statement is true, and how those libraries behave for different input size (from small to long)
56+
![Size](./length.svg)
3657

37-
#### Popular != Performant
58+
So the previous quote is right, over 512 characters all the libs become accurate enough.
3859

39-
- It's quite funny to notice the two most popular nodejs libs `languagedetect` (really high in google) and `franc` (over 4k stars on github) are both at the bottom of the charts.
40-
- At the opposite `langdetect` which is quite unknown performs really well and is the only one to come close to `tinyld`.
60+
But for a ~95% accuracy threshold:
4161

42-
#### Recommended
62+
- `tinyld` (green) reaches it around 24 characters
63+
- `langdetect` (cyan) and `cld` (orange) reach it around 48 characters
64+
- `lingua` (red) and `franc` (pink) need more than 256 characters to reach it
65+
66+
## Execution Time
67+
68+
![Size](./exec_time.svg)
69+
70+
Here we can notice few things about performance:
71+
72+
- `node-lingua` (red) collapse at a scary rate
73+
- `langdetect` (cyan) and `franc` (pink) seems to slow down at a similar rate
74+
- `tinyld` (green) slow down but at a really flat rate
75+
- `cld` (orange) is definitely the fastest and doesn't show any apparent slow down
76+
77+
But we've seen previously that some of those libraries need more than 256 characters to be accurate.
78+
It means they start to slow down at the same time they start to give decent results.
79+
80+
---
81+
82+
## **Conclusion**
83+
84+
### Recommended :thumbsup:
85+
86+
#### - By platform :computer:
4387

4488
- For **NodeJS**: `TinyLD`, `langdetect` or `node-cld` (fast and accurate)
4589
- For **Browser**: `TinyLD Light` or `franc-min` (small, decent accuracy, franc is less accurate but support more languages)
4690

47-
#### Not recommended
91+
#### - By usage :speech_balloon:
92+
93+
- Short text (chatbot, keywords, database, ...): `TinyLD` or `langdetect`
94+
- Long text (documents, webpage): `node-cld` or `TinyLD`
95+
96+
### Not recommended :thumbsdown:
97+
98+
- `node-lingua` has decent accuracy but is just too big and slow
99+
- `franc-all` is the worst in terms of accuracy, not a surprise because it tries to detect 400+ languages with only 3-grams. A technical demo to put big numbers but useless for real usage, even a language like english barely reaches ~45% detection rate.
100+
- `languagedetect` is light but just not accurate enough
101+
102+
---
103+
104+
## Last word :raising_hand:
105+
106+
Thanks for reading this article, those metrics are really helpful for the development of `tinyld`.
107+
It's used in the development to see the impact of every modification and features.
48108

49-
- `node-lingua` has a quite good accuracy but is just too big and slow
50-
- `franc-all` is the worse in term of accuracy, not a surprise because it tries to detect 400+ languages with only 3-grams. A technical demo to put big numbers but useless for real usage, even a language like english barely reach ~45% detection rate.
51-
- `languagedetect` is light but just not accurate enough, really focused on indo-european languages (support kazakh but not chinese, korean or japanese). Interesting fact, it's more accurate than franc on west european languages.
109+
If you want to contribute or see another library in this benchmark, [open an issue](https://github.com/komodojp/tinyld/issues)

‎docs/dev.md

+3
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ To be able to train the model
77
- Download the [Tatoeba sentence export](https://downloads.tatoeba.org/exports/sentences.tar.bz2)
88
- Extract in `data/tatoeba.csv`
99

10+
- Download the [UDHR](https://unicode.org/udhr/assemblies/udhr_txt.zip)
11+
- Extract in `data/udhr/`
12+
1013
## Commands
1114

1215
```sh

‎docs/exec_time.svg

+1
Loading

‎docs/install.md

+4
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,7 @@ detect('and this is english.') // en
2525
detectAll('ceci est un text en francais.')
2626
// [ { lang: 'fr', accuracy: 0.5238 }, { lang: 'ro', accuracy: 0.3802 }, ... ]
2727
```
28+
29+
---
30+
31+
[More about the API Documentation](./api.md)

‎docs/language.svg

+1-1
Loading

‎docs/length.svg

+1
Loading

‎docs/overall.svg

+1-1
Loading

‎package.json

+4-4
Original file line numberDiff line numberDiff line change
@@ -42,21 +42,21 @@
4242
"train:light": "cross-env TINYLD_CONFIG=light ts-node src/train.ts",
4343
"dev": "ts-node src/dev.ts",
4444
"bench": "run-s bench:*",
45+
"bench:tinyld": "cross-env TINYLD_CONFIG=normal ts-node src/benchmark/tinyld.ts",
46+
"bench:tinyld-light": "cross-env TINYLD_CONFIG=light ts-node src/benchmark/tinyld_light.ts",
4547
"bench:cld": "ts-node src/benchmark/cld.ts",
4648
"bench:franc": "ts-node src/benchmark/franc.ts",
4749
"bench:franc-all": "ts-node src/benchmark/franc-all.ts",
4850
"bench:franc-min": "ts-node src/benchmark/franc-min.ts",
4951
"bench:langdetect": "ts-node src/benchmark/langdetect.ts",
5052
"bench:languagedetect": "ts-node src/benchmark/languagedetect.ts",
51-
"bench:lingua": "cross-env TINYLD_CONFIG=normal ts-node src/benchmark/lingua.ts",
52-
"bench:tinyld": "cross-env TINYLD_CONFIG=normal ts-node src/benchmark/tinyld.ts",
53-
"bench:tinyld-light": "cross-env TINYLD_CONFIG=light ts-node src/benchmark/tinyld_light.ts",
53+
"bench:lingua": "ts-node src/benchmark/lingua.ts",
5454
"build": "run-p build:*",
5555
"build:normal-node": "esbuild src/index.ts --bundle --minify --platform=node --outfile=dist/tinyld.cjs.js",
5656
"build:normal-web": "esbuild src/index.ts --bundle --minify --platform=browser --format=esm --outfile=dist/tinyld.esm.js",
5757
"build:light-node": "esbuild src/index_light.ts --bundle --minify --platform=node --outfile=dist/tinyld.light.cjs.js",
5858
"build:light-web": "esbuild src/index_light.ts --bundle --minify --platform=browser --format=esm --outfile=dist/tinyld.light.esm.js",
59-
"build:post": "yarn gen:svg && yarn gen:mkd",
59+
"build:post": "yarn gen:svg && yarn gen:mkd && yarn lint",
6060
"gen:mkd": "node ./utils/mkdown.js",
6161
"gen:svg": "node ./utils/index.js",
6262
"lint": "eslint --ext .js,.ts --fix ./ && prettier --config .prettierrc --ignore-path .prettierignore --write \"**/*.{ts,js,md}\"",

0 commit comments

Comments
 (0)