|
1 | 1 | # Development
|
2 | 2 |
|
3 |
| -## Setup |
4 |
| - |
5 |
| -To be able to train the model |
6 |
| - |
7 |
| -- Download the [Tatoeba sentence export](https://downloads.tatoeba.org/exports/sentences.tar.bz2) |
8 |
| -- Extract in `data/tatoeba.csv` |
9 |
| - |
10 |
| -- Download the [UDHR](https://unicode.org/udhr/assemblies/udhr_txt.zip) |
11 |
| -- Extract in `data/udhr/` |
12 |
| - |
13 | 3 | ## Commands
|
14 | 4 |
|
15 | 5 | ```sh
|
16 |
| -# install deps |
| 6 | +# Install |
17 | 7 | yarn
|
18 | 8 |
|
19 |
| -# train and generate language profiles |
20 |
| -yarn train |
21 |
| - |
22 |
| -# build the library |
| 9 | +# Build |
23 | 10 | yarn build
|
24 | 11 |
|
25 |
| -# code style linting |
| 12 | +# Test |
| 13 | +yarn test |
| 14 | + |
| 15 | +# Lint / Auto-fix code style problems |
26 | 16 | yarn lint
|
| 17 | +``` |
27 | 18 |
|
28 |
| -# test |
29 |
| -yarn test |
| 19 | +--- |
| 20 | + |
| 21 | +## Install issues |
| 22 | + |
| 23 | +For the moment the library has lot of dev-dependencies purely for the benchmark process. |
| 24 | +Some of those libraries need to compile native code, which can be problematic (gcc, gyp, python, ...) |
| 25 | + |
| 26 | +If you run into those issues, one of the easiest solution is to remove the problematic dependencies from `package.json` then try again to install. |
| 27 | + |
| 28 | +[like here](https://github.com/komodojp/tinyld/issues/10#issuecomment-1019085476) |
| 29 | + |
| 30 | +It will only cause issue with `yarn bench`, but everything else should still work normally |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## Optional |
| 35 | + |
| 36 | +### 1. Generate profiles (`yarn train`) |
| 37 | + |
| 38 | +This step require lot of data and time, so it's optional and the result are store directly in git. |
| 39 | + |
| 40 | +This will analyse lot fo text in different language and build statistics to be able to identify the best features for each language |
| 41 | + |
| 42 | +To be able to train the model, you will need first to have the dataset locally |
| 43 | + |
| 44 | +``` |
| 45 | +Download Datasets |
| 46 | + - Download the [Tatoeba sentence export](https://downloads.tatoeba.org/exports/sentences.tar.bz2) |
| 47 | + - Extract in `data/tatoeba.csv` |
| 48 | + - Download the [UDHR](https://unicode.org/udhr/assemblies/udhr_txt.zip) |
| 49 | + - Extract in `data/udhr/` |
| 50 | +
|
| 51 | +Run yarn train |
| 52 | + - For each language, it will build statistics for words and n-grams |
| 53 | + - This goes through massive amount of data and will take time, prepare few coffee |
| 54 | +
|
| 55 | +When your profile files are generated, you can run `yarn build` and you will have a build with those new data |
30 | 56 | ```
|
| 57 | + |
| 58 | +### 2. Generate benchmark data (`yarn bench`) |
| 59 | + |
| 60 | +This step require a bit of time, it will run lot of different test for a set of libraries to generate the benchmark page and diagrams. |
0 commit comments