{"product_id":"9789819228492","title":"Data Engineering for Large Foundation Models A Handbook","description":"\u003ch1\u003eData Engineering for Large Foundation Models\u003c\/h1\u003e\u003ch2\u003eA Handbook\u003c\/h2\u003e\u003ch3\u003eJun Yu | Chang Wen Chen\u003c\/h3\u003e\u003cdiv\u003e\u003cb\u003eComputers \/ Database Administration \u0026amp; Management\u003c\/b\u003e\u003c\/div\u003e\u003cbr\u003e\u003cdiv\u003e\n\u003cp data-olk-copy-source=\"MessageBody\"\u003eData quality has become a decisive foundation for large foundation models, shaping their capability, reliability, alignment, and real-world applicability. \u003cem\u003eData Engineering for Large Foundation Models: A Handbook\u003c\/em\u003e provides a systematic and practice-oriented guide to data engineering for foundation models. Moving beyond a narrow focus on large language models, the book covers the data lifecycle behind language models, vision-language models, multimodal understanding systems, text-to-image and text-to-video generative models, reasoning models, agentic systems, and domain-specific AI applications.\u003c\/p\u003e\r\n\u003cp\u003eThe book presents a full-stack framework for building high-quality data pipelines for foundation-model development. It covers large-scale pre-training data engineering, including data sourcing, acquisition, cleaning, deduplication, decontamination, tokenization, serialization, efficient loading, and quality evaluation. It also addresses multimodal data engineering for image-text, document, video, and audio data, as well as post-training and alignment data construction, including SFT, preference data, RLHF, Chain-of-Thought reasoning data, tool-use data, agent memory, and multi-turn interaction data.\u003c\/p\u003e\r\n\u003cp\u003eThe book further examines data-centric AI systems, including synthetic data factories, knowledge distillation, enterprise-grade RAG and multimodal RAG pipelines, online feedback loops, knowledge updating, DataOps platforms, data governance, privacy protection, federated learning, and compliance-aware data engineering. Through end-to-end projects and reproducible system designs, readers gain hands-on experience with distributed pre-training data pipelines, domain-specific SFT datasets, multimodal instruction data factories, reasoning data flywheels, agent tool-use data factories, enterprise DataOps platforms, privacy-preserving pipelines, open-source model reproduction, and text-to-video training data pipelines. Using modern tools such as Ray, Spark, Dask, Parquet, WebDataset, vector databases, DVC, MLflow, and Airflow, this handbook equips data engineers, MLOps and DataOps professionals, AI researchers, and technical product teams to build reliable, scalable, and continuously improving foundation-model systems.\u003c\/p\u003e\n\u003c\/div\u003e\u003cdiv\u003e\n\u003cp\u003eJun Yu is Associate Professor at the Department of Automation, University of Science and Technology of China (USTC). His research expertise in multimedia computing and intelligent robotics directly supports the multimodal data engineering focus of this book. He has authored more than 200 academic papers, including over 100 first or corresponding author publications in leading IEEE\/ACM journals and CCF A conferences. As a frequent Senior Program Committee member for CVPR, ICCV, ICML, NeurIPS, IJCAI, and AAAI, he is deeply engaged with the global AI community. Bridging academia and industry, Professor Yu has led 40 major projects and contributed to Huawei’s MindSpore ecosystem, including MindFace and MindOCR. His work has earned the Wu Wenjun AI Science and Technology Award and multiple best paper prizes.\u003c\/p\u003e\r\n\u003cp\u003eChang Wen Chen is Chair Professor of Visual Computing and Interim Dean of the Faculty of Computer and Mathematical Sciences at The Hong Kong Polytechnic University. He previously served as Dean of Science and Engineering at The Chinese University of Hong Kong, Shenzhen, and Deputy Director at Peng Cheng Laboratory. Professor Chen has held editorial leadership roles as Editor in Chief of IEEE Transactions on Multimedia and IEEE Transactions on Circuits and Systems for Video Technology. His distinguished career has been recognized with the Alexander von Humboldt Award, the SUNY Chancellor’s Award for Excellence in Scholarship, and the UIUC ECE Distinguished Alumni Award. He is an IEEE Fellow, SPIE Fellow, and member of Academia Europaea. With decades of experience in visual computing and multimedia systems, Professor Chen brings authoritative insight into the data engineering challenges addressed in this book.\u003c\/p\u003e\n\u003c\/div\u003e\u003cbr\u003e\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd\u003ePublication Date: \u003c\/td\u003e\n\u003ctd\u003e14 December 2026\u003c\/td\u003e\n\u003c\/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003ePublisher: \u003c\/td\u003e\n\u003ctd\u003eSpringer Nature Singapore\u003c\/td\u003e\n\u003c\/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003eImprint: \u003c\/td\u003e\n\u003ctd\u003eSpringer\u003c\/td\u003e\n\u003c\/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003eISBN-13: \u003c\/td\u003e\n\u003ctd\u003e9789819228492\u003c\/td\u003e\n\u003c\/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003eFormat: \u003c\/td\u003e\n\u003ctd\u003eHardback\u003c\/td\u003e\n\u003c\/tr\u003e\n\u003c\/table\u003e","brand":"Springer Nature Singapore","offers":[{"title":"Default Title","offer_id":50300946219148,"sku":"9789819228492","price":197.99,"currency_code":"USD","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0710\/9545\/1788\/files\/9789819228492.jpg?v=1780616579","url":"https:\/\/fh90cf-fv.myshopify.com\/products\/9789819228492","provider":"Late Knight Books and Services, LLC","version":"1.0","type":"link"}