𝙎𝙚𝙠𝙖𝙞 : A Video Dataset towards World Exploration

Zhen Li1,2,4, Chuanhao Li1,📧, Xiaofeng Mao1, Shaoheng Lin1, Ming Li1, Shitian Zhao1, Zhaopan Xu1,
Xinyue Li1, Yukang Feng3, Jianwen Sun3, Zizhen Li3, Fanrui Zhang3, Jiaxin Ai3, Zhixiang Wang5,
Yuwei Wu2,4,📧, Tong He1, Jiangmiao Pang1, Yu Qiao1, Yunde Jia4, Kaipeng Zhang1,3,📧
1Shanghai AI Laboratory, 2Beijing Institute of Technology, 3Shanghai Innovation Institute,
4Shenzhen MSU-BIT University, 5The University of Tokyo
We are looking for collaboration and self-motivated interns. Contact: zhangkaipeng@pjlab.org.cn.

Abstract

Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning "world" in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning "dream" in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.

Introduction Video

Dataset Overview

pipeline

In this paper, we introduce Sekai (せかい, meaning "world" in Japanese), a high-quality egocentric worldwide video dataset for world exploration. Most videos contain audio for an immersive world generation. It also benefits other applications, such as video understanding, navigation, and video-audio co-generation. Sekai-Real comprises over 5000 hours of videos collected from YouTube with high-quality annotations. Sekai-Game comprises videos from a realistic video game, with ground-truth annotations. It has five distinct features:

1. High-quality and diverse video. All videos are recorded in 720p, featuring diverse weather, various times, and dynamic scenes.

2. Worldwide location. Videos span 100 countries and regions, showcasing 750+ cities with diverse cultures, activities, and landscapes.

3. Walking and drone view. Beyond walking videos, Seikai includes drone view (FPV and UAV) videos for unrestricted world exploration.

4. Long duration. All walking videos are at least 60 seconds long, ensuring real-world, long-term world exploration.

5. Rich annotations. All videos are annotated with location, scene, weather, crowd density, captions, and camera trajectories. YouTube videos' annotations are of high quality, while annotations from the game are considered ground truth.

Dataset Curation

pipeline

The Sekai dataset curation pipeline comprises four stages: video collection, pre-processing, annotation, and diverse sampling. It gathers over 8600 hours of high-resolution YouTube videos and 40 hours of photorealistic game footage. In pre-processing, videos are segmented into 400,000+ clips, followed by luminance, quality, subtitle, and trajectory filtering to ensure clean, high-quality data. Annotation is powered by LLMs (e.g., Qwen2.5-VL-72B, GPT-4o) and structure from motion models (MegaSaM), covering location, scene categories, detailed captions, and camera trajectories. Finally, a high-quality subset (Sekai-Real-HQ) is sampled using a combination of quality scores and diversity-aware strategies across location, category, content, and trajectory to ensure broad and balanced coverage for training.

YUME Model

pipeline

We train an interactive world exploration model named YUME (ゆめ, meaning "dream" in Japanese) using a subset of the Sekai-Real-HQ. Specifically, it receives an image and allows unrestricted exploitation using keyboard and mouse control from users.

BibTeX

@article{li2025sekai,
    title={Sekai: A Video Dataset towards World Exploration}, 
    author={Zhen Li and Chuanhao Li and Xiaofeng Mao and Shaoheng Lin and Ming Li and Shitian Zhao and Zhaopan Xu and Xinyue Li and Yukang Feng and Jianwen Sun and Zizhen Li and Fanrui Zhang and Jiaxin Ai and Zhixiang Wang and Yuwei Wu and Tong He and Jiangmiao Pang and Yu Qiao and Yunde Jia and Kaipeng Zhang},
    journal={arXiv preprint arXiv:2506.15675},
    year={2025}
}