AI & ML interests

Ontology + Concordance: The meeting of meaning

Recent Activity

Taishi-N324 authored a paper 15 days ago

On the Optimal Reasoning Length for RL-Trained Language Models

huu-ontocord updated a dataset about 1 month ago

ontocord/person2rel

huu-ontocord published a dataset about 1 month ago

ontocord/person2rel

View all activity

Papers

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

View all Papers

ajibawa-2023

posted an update 4 days ago

Post

2474

PHP-Code-Large

Dataset: ajibawa-2023/PHP-Code-Large

PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.

By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.

PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.

ajibawa-2023

posted an update 8 days ago

Post

3229

JavaScript-Code-Large
ajibawa-2023/JavaScript-Code-Large

JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.

By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.

JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .

ajibawa-2023

posted an update 10 days ago

Post

3121

Java-Code-Large ( ajibawa-2023/Java-Code-Large)

Java-Code-Large is a large-scale corpus of publicly available Java source code comprising more than 15 million java codes. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis.

By providing a high-volume, language-specific corpus, Java-Code-Large enables systematic experimentation in Java-focused model training, domain adaptation, and downstream code understanding tasks.