𝙎𝙚𝙠𝙖𝙞

𝙎𝙚𝙠𝙖𝙞 : A Video Dataset towards World Exploration

Zhen Li¹^,²^,⁴, Chuanhao Li¹^,^📧, Xiaofeng Mao¹, Shaoheng Lin¹, Ming Li¹, Shitian Zhao¹, Zhaopan Xu¹,
Xinyue Li¹, Yukang Feng³, Jianwen Sun³, Zizhen Li³, Fanrui Zhang³, Jiaxin Ai³, Zhixiang Wang⁵,
Yuwei Wu²^,⁴^,^📧, Tong He¹, Jiangmiao Pang¹, Yu Qiao¹, Yunde Jia⁴, Kaipeng Zhang¹^,³^,^📧

¹Shanghai AI Laboratory, ²Beijing Institute of Technology, ³Shanghai Innovation Institute,
⁴Shenzhen MSU-BIT University, ⁵The University of Tokyo

We are looking for collaboration and self-motivated interns. Contact: zhangkaipeng@pjlab.org.cn.

Abstract

Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning "world" in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning "dream" in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.

Dataset Overview

In this paper, we introduce Sekai (せかい, meaning "world" in Japanese), a high-quality egocentric worldwide video dataset for world exploration. Most videos contain audio for an immersive world generation. It also benefits other applications, such as video understanding, navigation, and video-audio co-generation. Sekai-Real comprises over 5000 hours of videos collected from YouTube with high-quality annotations. Sekai-Game comprises videos from a realistic video game, with ground-truth annotations. It has five distinct features:

1. High-quality and diverse video. All videos are recorded in 720p, featuring diverse weather, various times, and dynamic scenes.

2. Worldwide location. Videos span 100 countries and regions, showcasing 750+ cities with diverse cultures, activities, and landscapes.

3. Walking and drone view. Beyond walking videos, Seikai includes drone view (FPV and UAV) videos for unrestricted world exploration.

4. Long duration. All walking videos are at least 60 seconds long, ensuring real-world, long-term world exploration.

5. Rich annotations. All videos are annotated with location, scene, weather, crowd density, captions, and camera trajectories. YouTube videos' annotations are of high quality, while annotations from the game are considered ground truth.

Dataset Curation

The Sekai dataset curation pipeline comprises four stages: video collection, pre-processing, annotation, and diverse sampling. It gathers over 8600 hours of high-resolution YouTube videos and 40 hours of photorealistic game footage. In pre-processing, videos are segmented into 400,000+ clips, followed by luminance, quality, subtitle, and trajectory filtering to ensure clean, high-quality data. Annotation is powered by LLMs (e.g., Qwen2.5-VL-72B, GPT-4o) and structure from motion models (MegaSaM), covering location, scene categories, detailed captions, and camera trajectories. Finally, a high-quality subset (Sekai-Real-HQ) is sampled using a combination of quality scores and diversity-aware strategies across location, category, content, and trajectory to ensure broad and balanced coverage for training.

YUME Model

We train an interactive world exploration model named YUME (ゆめ, meaning "dream" in Japanese) using a subset of the Sekai-Real-HQ. Specifically, it receives an image and allows unrestricted exploitation using keyboard and mouse control from users.

BibTeX

@article{li2025sekai, title={Sekai: A Video Dataset towards World Exploration}, author={Zhen Li and Chuanhao Li and Xiaofeng Mao and Shaoheng Lin and Ming Li and Shitian Zhao and Zhaopan Xu and Xinyue Li and Yukang Feng and Jianwen Sun and Zizhen Li and Fanrui Zhang and Jiaxin Ai and Zhixiang Wang and Yuwei Wu and Tong He and Jiangmiao Pang and Yu Qiao and Yunde Jia and Kaipeng Zhang}, journal={arXiv preprint arXiv:2506.15675}, year={2025} }