You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -14,9 +14,10 @@ It's sad to hear, but it's not unusual.
14
14
As we can see [here](https://github.com/komodojp/tinyld/blob/develop/docs/benchmark.md#libraries), **Tinyld** is good but not perfect. Overall 1~2% of the time it will get it wrong.
15
15
16
16
The two things which usually increase error rate:
17
-
* short inputs, try to make it longer
18
-
* similar language (like spanish and catalan)
19
-
* generic names/brand which may appears in multiple language corpus
17
+
18
+
- short inputs, try to make it longer
19
+
- similar language (like spanish and catalan)
20
+
- generic names/brand which may appears in multiple language corpus
20
21
21
22
---
22
23
@@ -29,9 +30,10 @@ Based on how much space and resource you are ready to spend, we provide differen
29
30
30
31
-**Tinyld** : The general one (~500KB) which detect 64 languages
31
32
-**Tinyld Light** : Mostly for browser usage (~70KB) which detect 24 languages
32
-
-**Tinyld Heavy** : The one for backend usage (few MB) which focus on accuracy only
33
+
-**Tinyld Heavy (Soon)** : The one for backend usage (few MB) which focus on accuracy only
33
34
34
35
To select the one you want, simply change your import
36
+
35
37
```ts
36
38
import { detect } from'tinyld'
37
39
import { detect } from'tinyld/light'
@@ -40,32 +42,39 @@ import { detect } from 'tinyld/heavy'
40
42
41
43
---
42
44
43
-
## Can tinyld identify short strings?
45
+
## Can Tinyld identify short strings?
44
46
45
47
If by short you mean one or two word with a good accuracy, the answer is most likely **No**.
46
48
47
-
The key point here is to understand the algorithms behind language detection.
48
-
* How can you detect a text language, from many possibilities without embedding a dictionary of each language?
49
-
* Even just between 2 or 3 languages, how would you do it? Handcraft regexp for specific languages?
50
-
* How can you scale up this method easily to more languages?
49
+
The key point here is to understand algorithms behind language detection.
51
50
52
-
There are multiple approaches to solve this problem, but the two main ones are AI and statistics.
53
-
And the general idea is to recognize some patterns or succession of letters that are specific to each language. ([n-gram](https://en.wikipedia.org/wiki/N-gram))
51
+
- How can you detect a language, without embedding and checking a whole dictionary for each language?
52
+
- Even just between 2 or 3 languages, how would you do it? Handcraft regexp for specific languages?
53
+
- How can you scale up this method easily to more languages? Even to languages you dont speak or understand?
54
+
55
+
There are multiple approaches to solve this problem, but the two main ones are AI and statistics. And the general idea is to recognize some patterns or succession of letters that are specific for each language. ([n-gram](https://en.wikipedia.org/wiki/N-gram))
54
56
55
57
**Good part**:
56
-
* we don't need to understand a language syntax to detect it
57
-
* can be extended to more language fairly easily
58
-
* the signature of a language can be quite small only few KB
58
+
59
+
- We don't need to understand a language syntax to be able to detect it
60
+
- We can extend this method to more language fairly easily
61
+
- The signature of a language can be quite small only few KB
59
62
60
63
**Bad part**:
61
-
* it require a certain size of text to get a good detection level and valuable n-grams
62
-
* mixed language content is hard to detect
63
64
64
-
We are always looking for way to improve our process, and you can find some benchmark [related to this](https://github.com/komodojp/tinyld/blob/develop/docs/benchmark.md#accuracy-by-text-length).
65
+
- It requires a certain text size to get a good detection accuracy and valuable n-grams
66
+
- Common short words are usually the best for detection ("the", "or", "do", "this"), which lead to better results on sentence and not on single words
67
+
- It requires a clean corpus for training
68
+
- Mixed language content can be hard to detect
69
+
70
+
We are always trying to improve our process and detection rate, you can find some benchmark [related to this](https://github.com/komodojp/tinyld/blob/develop/docs/benchmark.md#accuracy-by-text-length).
65
71
But to give some numbers:
66
-
* Tinyld usually pass the ~95% detection accuracy threshold around ~24 characters
67
-
* It fall at ~80% for 12 characters (barely usable)
68
-
* Less than 10 characters it's just random
72
+
73
+
- Tinyld usually pass the ~95% detection accuracy threshold around ~24 characters
74
+
- It fall at ~80% for 12 characters (barely usable)
75
+
- Less than 10 characters it's just random
76
+
77
+
We recommend you to use the [TinyLD Playground](https://komodojp.github.io/tinyld/) to do some tests, and see how the accuracy increase with text length
69
78
70
79
---
71
80
@@ -77,4 +86,3 @@ One of the easy ways to workaround the size issue is to keep a context, a user i
77
86
So you can keep some buffer (like the last 256 characters of this user in this channel) and check this and not just the last message.
78
87
79
88
This gives stability and more accurate results to the detection.
0 commit comments