hackersgame commited on
Commit
454be19
·
verified ·
1 Parent(s): 928e26f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +18 -3
README.md CHANGED
@@ -10,7 +10,8 @@ tags:
10
  - free-software
11
  - dfsg
12
  datasets:
13
- - Skylion007/openwebtext
 
14
  metrics:
15
  - accuracy
16
  model-index:
@@ -38,7 +39,7 @@ pipeline_tag: feature-extraction
38
 
39
  # Free Language Embeddings (V34)
40
 
41
- 300-dimensional word vectors trained from scratch on ~2B tokens of DFSG-compliant text using a single RTX 3090.
42
 
43
  **66.5% on Google analogies** — beating the original word2vec (61% on 6B tokens) by 5.5 points with 1/3 the data.
44
 
@@ -49,12 +50,26 @@ pipeline_tag: feature-extraction
49
  | **Architecture** | Dynamic masking word2vec skip-gram |
50
  | **Dimensions** | 300 |
51
  | **Vocabulary** | 100,000 whole words |
52
- | **Training data** | ~2B tokens (OpenWebText subset, DFSG-compliant) |
53
  | **Training hardware** | Single NVIDIA RTX 3090 |
54
  | **Training time** | ~4 days (2M steps) |
55
  | **License** | GPL-3.0 |
56
  | **Parameters** | 60M (30M target + 30M context embeddings) |
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ## Benchmark Results
59
 
60
  | Model | Data | Google Analogies |
 
10
  - free-software
11
  - dfsg
12
  datasets:
13
+ - wikimedia/wikipedia
14
+ - pg19
15
  metrics:
16
  - accuracy
17
  model-index:
 
39
 
40
  # Free Language Embeddings (V34)
41
 
42
+ 300-dimensional word vectors trained from scratch on ~2B tokens of freely-licensed text using a single RTX 3090.
43
 
44
  **66.5% on Google analogies** — beating the original word2vec (61% on 6B tokens) by 5.5 points with 1/3 the data.
45
 
 
50
  | **Architecture** | Dynamic masking word2vec skip-gram |
51
  | **Dimensions** | 300 |
52
  | **Vocabulary** | 100,000 whole words |
53
+ | **Training data** | ~2B tokens, all [DFSG-compliant](https://wiki.debian.org/DFSGLicenses) (see below) |
54
  | **Training hardware** | Single NVIDIA RTX 3090 |
55
  | **Training time** | ~4 days (2M steps) |
56
  | **License** | GPL-3.0 |
57
  | **Parameters** | 60M (30M target + 30M context embeddings) |
58
 
59
+ ### Training Data
60
+
61
+ All training data meets the [Debian Free Software Guidelines](https://wiki.debian.org/DFSGLicenses) for redistribution, modification, and use. No web scrapes, no proprietary datasets.
62
+
63
+ | Source | Weight | License |
64
+ |--------|--------|---------|
65
+ | Wikipedia | 30% | CC BY-SA 3.0 |
66
+ | Project Gutenberg | 20% | Public domain |
67
+ | arXiv | 20% | Various open access |
68
+ | Stack Exchange | 16% | CC BY-SA 4.0 |
69
+ | US Government Publishing Office | 10% | Public domain (US gov) |
70
+ | RFCs | 2.5% | IETF Trust |
71
+ | Linux kernel docs, Arch Wiki, TLDP, GNU manuals, man pages | 1.5% | GPL/GFDL |
72
+
73
  ## Benchmark Results
74
 
75
  | Model | Data | Google Analogies |