Spark高级数据分析-(影印版)

1星价 ¥37.5 (6.7折)

2星价￥37.5 定价￥56.0

作者：(美)里扎(SandyRyza)等

出版社：东南大学出版社

本类榜单：计算机/网络

分类：计算机/网络 > 数据库 > 数据仓库与数据挖掘

暂无评论

图文详情

ISBN：9787564159108
装帧：一般胶版纸
册数：暂无
重量：暂无
开本：16开
页数：260
出版时间：2015-09-01
条形码：9787564159108 ; 978-7-5641-5910-8

本书特色

在里扎等编著的《spark高级数据分析（影印版）（英文版）》这本实用书籍中，4位cloude阳公司的数据科学家讲解了一系列自包含模式，用于在 spark中进行大规模数据分析。本书作者们把spark、统计原理和现实世界中的数据集合放到一起，通过实例教你如何解决数据分析问题。　　你将从spark及其生态系统的介绍开始，然后深入运用标准技巧的模式——归类、聚合过滤及异常检测等，这些技巧被用于生物基因、安全和金融等行业。如果你对机器学习和统计学有初步了解，使用java 、pytton或者scala编程，就会发现这些模式对于你的数据分析应用程序会非常有用。　　模式包括：音乐推荐和audioscrobbler数据集合用决策树分析森林覆盖用k均值聚合检测网络流量中的异常用潜在语义分析理解维基百科用graphx分析共生网络用地理空间和瞬态数据分析纽约市出租车路线的数据用蒙地卡罗模拟来估计金融风险分析基因数据和bdg项目通过pyspark和thunder分析神经造影数据

内容简介

网络数据量迅速增大的时代，亟需能高效迅捷分析处理数据的工具，Spark应运而生。本书由Spark开发者及核心成员打造，带领读者快速掌握用Spark收集、计算、简化保存海量数据的方法，学会交互、迭代和增量式分析，解决分区、数据本地化和自定义序列化等问题。

forewordpreface1. analyzing big data the challenges of data science introducing apache spark about this book2. introduction to data analysis with scala and spark scala for data scientists the spark programming model record linkage getting started: the spark shell and sparkcontext bringing data from the cluster to the client shipping code from the client to the cluster structuring data with tuples and case classes aggregations creating histograms summary statistics for continuous variables creating reusable code for computing summary statistics simple variable selection and scoring where to go from here3. recommending music and the audioscrobbler data set data set the alternating least squares recommender algorithm preparing the data building a first model spot checking recommendations evaluating recommendation quality computing auc hyperparameter selection making recommendations where to go from here4. predicting forest cover with decision trees fast forward to regression vectors and features training examples decision trees and forests covtype data set preparing the data a first decision tree decision tree hyperparameters tuning decision trees categorical features revisited random decision forests making predictions where to go from here5. anomaly detection in network traffic with k-means clustering anomaly detection k-means clustering network intrusion kdd cup 1999 data set a first take on clustering choosing k visualization in r feature normalization categorical variables using labels with entropy clustering in action where to go from here6. understanding wikipedia with latent semantic analysis the term-document matrix getting the data parsing and preparing the data lemmatization computing the tf-idfs singular value decomposition finding important concepts querying and scoring with the low-dimensional representation term-term relevance document-document relevance term-document relevance multiple-term queries where to go from here7. analyzing co-occurrence networks with graphx the medline citation index: a network analysis getting the data parsing xml documents with scala's xml library analyzing the mesh major topics and their co-occurrences constructing a co-occurrence network with graphx understanding the structure of networks connected components degree distribution filtering out noisy edges processing edgetriplets analyzing the filtered graph small-world networks cliques and clustering coefficients computing average path length with pregel where to go from here8. 6eospatial and temporal data analysis on the new york city taxi trip data getting the data working with temporal and geospatial data in spark temporal data with jodatime and nscalatime geospatial data with the esri geometry api and spray exploring the esri geometry api intro to geojson preparing the new york city taxi trip data handling invalid records at scale geospatial analysis sessionization in spark building sessions: secondary sorts in spark where to go from here 9. estimating financial risk through monte carlo simulation terminology methods for calculating var variance-covariance historical simulation monte carlo simulation our model getting the data preprocessing determining the factor weights sampling the multivariate normal distribution running the trials visualizing the distribution of returns evaluating our results where to go from here10. analyzing genomics data and the bdg project decoupling storage from modeling ingesting genomics data with the adam cli parquet format and columnar storage predicting transcription factor binding sites from encode data querying genotypes from the 1000 genomes project where to go from here11. analyzing neuroimaging data with pyspark and thunder overview of pyspark pyspark internals overview and installation of the thunder library loading data with thunder thunder core data types categorizing neuron types with thunder where to go from herea.deeper into sparkb.upcoming mllib pipelines apiindex

展开全部

本类五星书