Skip to content

Commit 6e914b1

Browse files
committed
Update FAQ
1 parent e2c515c commit 6e914b1

File tree

1 file changed

+33
-25
lines changed

1 file changed

+33
-25
lines changed

‎docs/faq.md

+33-25
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# Frequently Asked Question
22

3-
* [Language Detection Error](#my-text-is-detected-in-the-wrong-language)
4-
* [Cand I have a custom version](#can-i-have-a-version-specific-for-my-app-and-my-needs)
5-
* [Short text detection issues](#can-tinyld-identify-short-strings)
6-
* [Live Chat usage](#can-i-use-tinyld-for-an-application-like-a-chat-even-if-texts-are-short)
3+
- [Language Detection Error](#my-text-is-detected-in-the-wrong-language)
4+
- [Cand I have a custom version](#can-i-have-a-version-specific-for-my-app-and-my-needs)
5+
- [Short text detection issues](#can-tinyld-identify-short-strings)
6+
- [Live Chat usage](#can-i-use-tinyld-for-an-application-like-a-chat-even-if-texts-are-short)
77

88
---
99

@@ -14,9 +14,10 @@ It's sad to hear, but it's not unusual.
1414
As we can see [here](https://github.com/komodojp/tinyld/blob/develop/docs/benchmark.md#libraries), **Tinyld** is good but not perfect. Overall 1~2% of the time it will get it wrong.
1515

1616
The two things which usually increase error rate:
17-
* short inputs, try to make it longer
18-
* similar language (like spanish and catalan)
19-
* generic names/brand which may appears in multiple language corpus
17+
18+
- short inputs, try to make it longer
19+
- similar language (like spanish and catalan)
20+
- generic names/brand which may appears in multiple language corpus
2021

2122
---
2223

@@ -29,9 +30,10 @@ Based on how much space and resource you are ready to spend, we provide differen
2930

3031
- **Tinyld** : The general one (~500KB) which detect 64 languages
3132
- **Tinyld Light** : Mostly for browser usage (~70KB) which detect 24 languages
32-
- **Tinyld Heavy** : The one for backend usage (few MB) which focus on accuracy only
33+
- **Tinyld Heavy (Soon)** : The one for backend usage (few MB) which focus on accuracy only
3334

3435
To select the one you want, simply change your import
36+
3537
```ts
3638
import { detect } from 'tinyld'
3739
import { detect } from 'tinyld/light'
@@ -40,32 +42,39 @@ import { detect } from 'tinyld/heavy'
4042

4143
---
4244

43-
## Can tinyld identify short strings?
45+
## Can Tinyld identify short strings?
4446

4547
If by short you mean one or two word with a good accuracy, the answer is most likely **No**.
4648

47-
The key point here is to understand the algorithms behind language detection.
48-
* How can you detect a text language, from many possibilities without embedding a dictionary of each language?
49-
* Even just between 2 or 3 languages, how would you do it? Handcraft regexp for specific languages?
50-
* How can you scale up this method easily to more languages?
49+
The key point here is to understand algorithms behind language detection.
5150

52-
There are multiple approaches to solve this problem, but the two main ones are AI and statistics.
53-
And the general idea is to recognize some patterns or succession of letters that are specific to each language. ([n-gram](https://en.wikipedia.org/wiki/N-gram))
51+
- How can you detect a language, without embedding and checking a whole dictionary for each language?
52+
- Even just between 2 or 3 languages, how would you do it? Handcraft regexp for specific languages?
53+
- How can you scale up this method easily to more languages? Even to languages you dont speak or understand?
54+
55+
There are multiple approaches to solve this problem, but the two main ones are AI and statistics. And the general idea is to recognize some patterns or succession of letters that are specific for each language. ([n-gram](https://en.wikipedia.org/wiki/N-gram))
5456

5557
**Good part**:
56-
* we don't need to understand a language syntax to detect it
57-
* can be extended to more language fairly easily
58-
* the signature of a language can be quite small only few KB
58+
59+
- We don't need to understand a language syntax to be able to detect it
60+
- We can extend this method to more language fairly easily
61+
- The signature of a language can be quite small only few KB
5962

6063
**Bad part**:
61-
* it require a certain size of text to get a good detection level and valuable n-grams
62-
* mixed language content is hard to detect
6364

64-
We are always looking for way to improve our process, and you can find some benchmark [related to this](https://github.com/komodojp/tinyld/blob/develop/docs/benchmark.md#accuracy-by-text-length).
65+
- It requires a certain text size to get a good detection accuracy and valuable n-grams
66+
- Common short words are usually the best for detection ("the", "or", "do", "this"), which lead to better results on sentence and not on single words
67+
- It requires a clean corpus for training
68+
- Mixed language content can be hard to detect
69+
70+
We are always trying to improve our process and detection rate, you can find some benchmark [related to this](https://github.com/komodojp/tinyld/blob/develop/docs/benchmark.md#accuracy-by-text-length).
6571
But to give some numbers:
66-
* Tinyld usually pass the ~95% detection accuracy threshold around ~24 characters
67-
* It fall at ~80% for 12 characters (barely usable)
68-
* Less than 10 characters it's just random
72+
73+
- Tinyld usually pass the ~95% detection accuracy threshold around ~24 characters
74+
- It fall at ~80% for 12 characters (barely usable)
75+
- Less than 10 characters it's just random
76+
77+
We recommend you to use the [TinyLD Playground](https://komodojp.github.io/tinyld/) to do some tests, and see how the accuracy increase with text length
6978

7079
---
7180

@@ -77,4 +86,3 @@ One of the easy ways to workaround the size issue is to keep a context, a user i
7786
So you can keep some buffer (like the last 256 characters of this user in this channel) and check this and not just the last message.
7887

7988
This gives stability and more accurate results to the detection.
80-

0 commit comments

Comments
 (0)