Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
ajibawa-2023 
posted an update about 1 month ago
Post
6109
Go-Code-Large
Dataset: ajibawa-2023/Go-Code-Large

Go-Code-Large is a large-scale corpus of Go (Golang) programming language source code, comprising 316,427 code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, cloud-native systems, and modern backend software engineering.

By offering a focused and curated dataset for Go, this corpus enables experimentation in concurrent programming, distributed systems, and performance-oriented backend services—domains where Go is widely adopted.

Go-Code-Large addresses the relative scarcity of large, language-specific datasets for Go, enabling targeted research into idiomatic Go patterns, concurrency primitives, and scalable system design.

Related: https://huggingface.co/datasets/jedisct1/golang

This dataset is a large (1,300,000+) corpus of commit messages and their corresponding code diff, from the most popular Go packages as well as the Go compiler itself.

·

Thanks for sharing this!