Big data analysis
Lecturer: Janez Povh, Leon Kos
Syllabus outline:
• Introduction to the big data analysis: what is big data, where we find it, how to store it?
• Visualizations of big data: which tools and diagrams are suitable for representing big data.
• Simple big data analysis: search for similar items: near neighbour search, similarity preserving summaries of sets.
• Network and Link analysis: PageRank algorithm; Link spam; Hub and authorities;
• Data streams: the stream data models; sampling data in a stream; filtering streams; sensors data, decision rules based on sensor data;
• Supervised and unsupervised learning from big data: clustering, classification and regression analysis,
• Hadoop: what is Hadoop distributed file system, how map-reduce framework works, how do we generate and schedule data-related jobs.
• First steps in R and RHadoop – we will introduce programming language R and Hadoop libraries rmr and rhdfs. Additionally, RStudio as GUI will be introduced. Students will receive virtual machine with these software installed.
• Analysis, visualisation and statistical learning from big data using RHadoop
• Testing RHadoop on supercomputers: students will get access to a supercomputer at University of Ljubljana to perform really big data analysis
Objectives and competences:
The main objective of this course is to make the students competent to work with big data using state of the art hardware and software tools.
General competences:
• the use of methodological tools, ie. implementation, coordination and organization of research, application of various quantitative research methods and techniques
• the use and combining the knowledge from different disciplines
• the ability to use information and communications technologies and data analytic tools in engineering
• ability to collect, store, analyse and interpret big data
Subject-specific competences:
• knowledge of the specific features of big data
• knowledge of methods adjusted for the analysis of big data
• knowledge of tools for analyzing big data
• the ability to use high-performance computers to analyze big data
• mastering R and Hadoop for Big Data analysis
Intended learning outcomes:
The student will:
• understand the specificity of big data analysis compared to classical data analysis
• master the methods, designed for big data analysis with focus to the applications in engineering;
• learn how to use high performance computers and state of the art open source software (RHadoop) to retrieve, store and analyse big data