komodojp
diff --git a/‎Readme.md
+10-9 b/‎Readme.md
+10-9
diff --git a/‎docs/algorithm.md
+2-32 b/‎docs/algorithm.md
+2-32
diff --git a/‎docs/benchmark.md
+86-28 b/‎docs/benchmark.md
+86-28
diff --git a/‎docs/dev.md
+3 b/‎docs/dev.md
+3
diff --git a/‎docs/exec_time.svg
+1 b/‎docs/exec_time.svg
+1
diff --git a/‎docs/install.md
+4 b/‎docs/install.md
+4
diff --git a/‎docs/language.svg
+1-1 b/‎docs/language.svg
+1-1
diff --git a/‎docs/length.svg
+1 b/‎docs/length.svg
+1
diff --git a/‎docs/overall.svg
+1-1 b/‎docs/overall.svg
+1-1
diff --git a/‎package.json
+4-4 b/‎package.json
+4-4
@@ -7,27 +7,28 @@
 
 ![logo](./banner.png)
 
+## :tada: Description
+
 **Tiny** **L**anguage **D**etector, simply detect the language of a unicode UTF-8 text:
 
-- Alternative to libraries like `CLD` or `Franc`
 - Pure JS, No api call, No dependencies (Node and Browser compatible)
 - Blazing fast and low memory footprint (unlike ML methods)
-- Available for both `CommonJS` and `ESM`
+- Train with dataset from [Tatoeba](https://tatoeba.org/en/) and [UDHR](https://unicode.org/udhr/)
 - Support [64 languages](./docs/langs.md) (24 for [the web version](./docs/light.md))
+- Reliable even for really short texts (chatbot, keywords, ...)
 - Support both ISO-639-1 & ISO-639-2
+- Available for both `CommonJS` and `ESM`
 
 ## Links
 
 - [**Playground** - Try the library](https://runkit.com/kefniark/tinyld)
 - [Getting Started](./docs/install.md)
-- [API](./docs/api.md)
-- [CLI](./docs/cli.md)
 - [Supported Languages](./docs/langs.md)
 - [Algorithm](./docs/algorithm.md)
 
 ---
 
-## Getting Started
+## :floppy_disk: Getting Started
 
 ### Install
 
@@ -39,7 +40,7 @@ yarn add tinyld # or npm install --save tinyld
 
 ---
 
-### **TinyLD API**
+### :page_facing_up: **TinyLD API**
 
 ```js
 import { detect, detectAll } from 'tinyld'
@@ -57,7 +58,7 @@ detectAll('ceci est un text en francais.')
 
 ---
 
-### **TinyLD CLI**
+### :paperclip: **TinyLD CLI**
 
 ```bash
 tinyld This is the text that I want to check
@@ -68,7 +69,7 @@ tinyld This is the text that I want to check
 
 ---
 
-## Performance
+## :chart_with_upwards_trend: Performance
 
 Here is a comparison of **Tinyld** against other popular libraries.
 
@@ -78,4 +79,4 @@ To summary in one sentence:
 
 > Better, Faster, Smaller
 
-[More Information](./docs/benchmark.md)
+[More Benchmark Information](./docs/benchmark.md)
@@ -4,6 +4,8 @@ This library uses a variant of the usual N-gram algorithm, which gives fast and
 
 Most libraries are directly using a bayesian scoring algorithm to identify a text language. But TinyLD, decided to add few steps before and after, trying to mimic human logic and identify language with their unique character patterns or word usage.
 
+This is similar to what ML methods use, that's why this library has a training phase too. The goal is to find which "features" or "n-gram" are the more useful for detection without hardcoding any language specific rules. The heavy lifting is done during build time, so at runtime it can be fast and efficient.
+
 ## How it works ?
 
 The string will be split into chunks based on punctuation. Each chunk will be evaluated separately and results merged later weighted with the chunk size.
@@ -73,35 +75,3 @@ For example:
 - English accuracy is good thanks to word detection (JA ~1.5% but EN ~98%)
 
 Which is why together those methods get an overall accuracy > 95%
-
-```
---- Per language Accuracy ---
- - fra - 91.1079%
- - deu - 98.5069%
- - eng - 97.507%
- - rus - 92.241%
- - jpn - 99.96%
- - spa - 88.2282%
- - por - 94.2674%
- - ita - 92.3744%
- - cmn - 97.4537%
- - ara - 98.9468%
- - heb - 100%
- - fin - 92.7743%
- - tur - 96.0139%
- - kor - 99.8933%
- - jav - 74.2857%
- - hin - 95.8406%
-```
-
----
-
-## How the library can be so small ? (~700KB for node, ~90KB for web)
-
-In a normal n-gram algorithm, between languages there are lot of overlap, duplicates or non specific grams (which match more than 15 languages). Gram are just downloaded and check at runtime, which is not really optimized.
-
-Our multi-pass model allow us to have a really compact language profile file.
-
-- **AOT Logic**: the language profiles are pre-compiled ahead of time (avoid duplicates and not relevant gram/words)
-- **Per Language logic**: language really well detected in one method >98% are removed from other methods (with their data)
-- **Pass Exclusion**: gram detected in one pass are automatically ignored by later pass, this allow us to greatly reduce the size of n-gram (or words) to store and check
@@ -1,51 +1,109 @@
-# Benchmark Language Detection Libraries
-
-    > Benchmark done on tatoeba dataset (~9M sentences) on 16 of the most common languages.
-
-### **Remark**
+# NodeJS Language Detection Benchmark :rocket:
 
+- This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances
 - Language evaluated in this benchmark:
   - Asia: `jpn`, `cmn`, `kor`, `hin`
   - Europe: `fra`, `spa`, `por`, `ita`, `nld`, `eng`, `deu`, `fin`, `rus`
   - Middle east: , `tur`, `heb`, `ara`
-- This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances
+- This page and graphs are auto-generated from the code
 
-## Overall
+---
+
+## Libraries
+
+Here is the list of libraries in this benchmark
 
 | Library        | Script                      | Properly Identified | Improperly identified | Not identified | Avg Execution Time | Disk Size |
 | -------------- | --------------------------- | ------------------- | --------------------- | -------------- | ------------------ | --------- |
-| **TinyLD**     | `yarn bench:tinyld`         | 97.7311%            | 1.9247%               | 0.3441%        | 0.0966ms.          | 930KB     |
-| **TinyLD Web** | `yarn bench:tinyld-light`   | 97.4512%            | 2.1131%               | 0.4358%        | 0.0802ms.          | **110KB** |
-| **langdetect** | `yarn bench:langdetect`     | 94.879%             | 5.1093%               | 0.0117%        | 0.4631ms.          | 1.8MB     |
-| node-cld       | `yarn bench:cld`            | 88.9148%            | 1.7489%               | 9.3363%        | 0.0477ms.          | > 10MB    |
-| node-lingua    | `yarn bench:lingua`         | 86.3093%            | 0.13%                 | 13.5607%       | 1.5695ms.          | ~100MB    |
-| franc          | `yarn bench:franc`          | 68.7783%            | 26.3432%              | 4.8785%        | 0.1626ms.          | 267KB     |
-| franc-min      | `yarn bench:franc-min`      | 65.5163%            | 23.5794%              | 10.9044%       | 0.0734ms.          | **119KB** |
-| franc-all      | `yarn bench:franc-all`      | 61.7893%            | 33.3322%              | 4.8785%        | 0.3688ms.          | 509KB     |
-| languagedetect | `yarn bench:languagedetect` | 61.6068%            | 12.295%               | 26.0982%       | 0.212ms.           | **240KB** |
-
-which gives us the following graph
+| **TinyLD**     | `yarn bench:tinyld`         | 98.274%             | 1.6277%               | 0.0983%        | 0.1002ms.          | 930KB     |
+| **TinyLD Web** | `yarn bench:tinyld-light`   | 97.8901%            | 1.9765%               | 0.1334%        | 0.0868ms.          | **110KB** |
+| **langdetect** | `yarn bench:langdetect`     | 95.6623%            | 4.3377%               | 0%             | 0.584ms.           | 1.8MB     |
+| node-cld       | `yarn bench:cld`            | 92.2917%            | 1.6375%               | 6.0708%        | 0.0687ms.          | > 10MB    |
+| node-lingua    | `yarn bench:lingua`         | 87.1399%            | 0.069%                | 12.7912%       | 1.1357ms.          | ~100MB    |
+| franc          | `yarn bench:franc`          | 74.0083%            | 25.9917%              | 0%             | 0.1947ms.          | 267KB     |
+| franc-min      | `yarn bench:franc-min`      | 70.1002%            | 23.4164%              | 6.4834%        | 0.0914ms.          | **119KB** |
+| franc-all      | `yarn bench:franc-all`      | 66.3846%            | 33.6154%              | 0%             | 0.576ms.           | 509KB     |
+| languagedetect | `yarn bench:languagedetect` | 65.9069%            | 11.3828%              | 22.7103%       | 0.2643ms.          | **240KB** |
+
+---
+
+## Global Accuracy
+
 ![Benchmark](./overall.svg)
 
+We see two group of libraries (separated by `node-lingua` in the middle)
+
+- `tinyld`, `langdetect` and `cld` over 90% accuracy
+- `franc` and `languagedetect` under 75% accuracy
+
 ## Per Language
 
-Let's now compare those libraries per language
 ![Language](./language.svg)
 
-### **Conclusion**
+We see big differences between languages:
+
+- **Japanese** or **Korean** are almost at 100% for every libs (lot of unique characters)
+- **Spanish** and **Portuguese** are really close and cause more false-positive and an higher error-rate
+
+## Accuracy By Text length
+
+Most libraries are using statistical analysis, so longer is the input text, better will be the detection.
+So we can often see quotes like this in those library documentations.
+
+> Make sure to pass it big documents to get reliable results.
+
+Let's see if this statement is true, and how those libraries behave for different input size (from small to long)
+![Size](./length.svg)
 
-#### Popular != Performant
+So the previous quote is right, over 512 characters all the libs become accurate enough.
 
-- It's quite funny to notice the two most popular nodejs libs `languagedetect` (really high in google) and `franc` (over 4k stars on github) are both at the bottom of the charts.
-- At the opposite `langdetect` which is quite unknown performs really well and is the only one to come close to `tinyld`.
+But for a ~95% accuracy threshold:
 
-#### Recommended
+- `tinyld` (green) reaches it around 24 characters
+- `langdetect` (cyan) and `cld` (orange) reach it around 48 characters
+- `lingua` (red) and `franc` (pink) need more than 256 characters to reach it
+
+## Execution Time
+
+![Size](./exec_time.svg)
+
+Here we can notice few things about performance:
+
+- `node-lingua` (red) collapse at a scary rate
+- `langdetect` (cyan) and `franc` (pink) seems to slow down at a similar rate
+- `tinyld` (green) slow down but at a really flat rate
+- `cld` (orange) is definitely the fastest and doesn't show any apparent slow down
+
+But we've seen previously that some of those libraries need more than 256 characters to be accurate.
+It means they start to slow down at the same time they start to give decent results.
+
+---
+
+## **Conclusion**
+
+### Recommended :thumbsup:
+
+#### - By platform :computer:
 
 - For **NodeJS**: `TinyLD`, `langdetect` or `node-cld` (fast and accurate)
 - For **Browser**: `TinyLD Light` or `franc-min` (small, decent accuracy, franc is less accurate but support more languages)
 
-#### Not recommended
+#### - By usage :speech_balloon:
+
+- Short text (chatbot, keywords, database, ...): `TinyLD` or `langdetect`
+- Long text (documents, webpage): `node-cld` or `TinyLD`
+
+### Not recommended :thumbsdown:
+
+- `node-lingua` has decent accuracy but is just too big and slow
+- `franc-all` is the worst in terms of accuracy, not a surprise because it tries to detect 400+ languages with only 3-grams. A technical demo to put big numbers but useless for real usage, even a language like english barely reaches ~45% detection rate.
+- `languagedetect` is light but just not accurate enough
+
+---
+
+## Last word :raising_hand:
+
+Thanks for reading this article, those metrics are really helpful for the development of `tinyld`.
+It's used in the development to see the impact of every modification and features.
 
-- `node-lingua` has a quite good accuracy but is just too big and slow
-- `franc-all` is the worse in term of accuracy, not a surprise because it tries to detect 400+ languages with only 3-grams. A technical demo to put big numbers but useless for real usage, even a language like english barely reach ~45% detection rate.
-- `languagedetect` is light but just not accurate enough, really focused on indo-european languages (support kazakh but not chinese, korean or japanese). Interesting fact, it's more accurate than franc on west european languages.
+If you want to contribute or see another library in this benchmark, [open an issue](https://github.com/komodojp/tinyld/issues)
@@ -7,6 +7,9 @@ To be able to train the model
 - Download the [Tatoeba sentence export](https://downloads.tatoeba.org/exports/sentences.tar.bz2)
 - Extract in `data/tatoeba.csv`
 
+- Download the [UDHR](https://unicode.org/udhr/assemblies/udhr_txt.zip)
+- Extract in `data/udhr/`
+
 ## Commands
 
 ```sh
 
@@ -25,3 +25,7 @@ detect('and this is english.') // en
 detectAll('ceci est un text en francais.')
 // [ { lang: 'fr', accuracy: 0.5238 }, { lang: 'ro', accuracy: 0.3802 }, ... ]
 ```
+
+---
+
+[More about the API Documentation](./api.md)
@@ -42,21 +42,21 @@
     "train:light": "cross-env TINYLD_CONFIG=light ts-node src/train.ts",
     "dev": "ts-node src/dev.ts",
     "bench": "run-s bench:*",
+    "bench:tinyld": "cross-env TINYLD_CONFIG=normal ts-node src/benchmark/tinyld.ts",
+    "bench:tinyld-light": "cross-env TINYLD_CONFIG=light ts-node src/benchmark/tinyld_light.ts",
     "bench:cld": "ts-node src/benchmark/cld.ts",
     "bench:franc": "ts-node src/benchmark/franc.ts",
     "bench:franc-all": "ts-node src/benchmark/franc-all.ts",
     "bench:franc-min": "ts-node src/benchmark/franc-min.ts",
     "bench:langdetect": "ts-node src/benchmark/langdetect.ts",
     "bench:languagedetect": "ts-node src/benchmark/languagedetect.ts",
-    "bench:lingua": "cross-env TINYLD_CONFIG=normal ts-node src/benchmark/lingua.ts",
-    "bench:tinyld": "cross-env TINYLD_CONFIG=normal ts-node src/benchmark/tinyld.ts",
-    "bench:tinyld-light": "cross-env TINYLD_CONFIG=light ts-node src/benchmark/tinyld_light.ts",
+    "bench:lingua": "ts-node src/benchmark/lingua.ts",
     "build": "run-p build:*",
     "build:normal-node": "esbuild src/index.ts --bundle --minify --platform=node --outfile=dist/tinyld.cjs.js",
     "build:normal-web": "esbuild src/index.ts --bundle --minify --platform=browser --format=esm --outfile=dist/tinyld.esm.js",
     "build:light-node": "esbuild src/index_light.ts --bundle --minify --platform=node --outfile=dist/tinyld.light.cjs.js",
     "build:light-web": "esbuild src/index_light.ts --bundle --minify --platform=browser --format=esm --outfile=dist/tinyld.light.esm.js",
-    "build:post": "yarn gen:svg && yarn gen:mkd",
+    "build:post": "yarn gen:svg && yarn gen:mkd && yarn lint",
     "gen:mkd": "node ./utils/mkdown.js",
     "gen:svg": "node ./utils/index.js",
     "lint": "eslint --ext .js,.ts --fix ./ && prettier --config .prettierrc --ignore-path .prettierignore --write \"**/*.{ts,js,md}\"",