Universitat Internacional de Catalunya

MÓDULO 3.1: Métodos Estadísticos y Data Mining

MÓDULO 3.1: Métodos Estadísticos y Data Mining
5
13945
1
First semester
OB
Main language of instruction: Catalan

Other languages of instruction: English, Spanish,

Teaching staff

Introduction

Statistics is a fundamental pillar of the science of big data (or also called “Data Science”, of its Anglo-Saxon origin) and is the tool that will give the data professional the ability to understand the huge amounts of numerical information, to thus being able to draw conclusions and make decisions based on them. As an essential part of the scientific method, it is the discipline that puts "science" in "Data Science".
In this course the fundamental principles of classical and modern statistics are given, with special emphasis on the mathematical theory behind it. For this reason, this can be considered as a regular mathematics subject, with its good dose of theory, problems and practices. However, the course contains an important part of statistical programming, built around the usual scientific Python libraries (Numpy, Scipy, Pandas and the like).

Pre-course requirements

Basic notions of mathematics (ESO / Bachelor level) and familiarity with at least one programming language.

Objectives

  • Know how to reason mathematically and apply the scientific method, as well as understand its importance in making decisions based on data.
  • Assimilate the basic concepts of probability theory.
  • Understand and correctly apply the concept of statistical significance. Know how to identify what constitutes statistical evidence.
  • Be able to use software and programming languages to perform statistical analysis on a set of data.
  • Understand and know how to apply statistical simulation algorithms.

Learning outcomes of the subject

The student will have to be able to elaborate a plan of implantation of an IS of an example company, as a case study. Detail the information systems plans, at a high level and be able to understand, in a negotiation, which evaluation criteria have to be applied for the prioritization in the implementation of this deployment plan.

Syllabus

Topic 0: Introduction to fundamental concepts of mathematics
0.1 Numbers and operations
0.2 Basic Mathematical Analysis
0.3 Derivatives and integrals
0.4 Python: Introduction and Fundamental Data Structures

Topic 1: Fundamentals of probability
1.1 Why do we use statistics?
1.2 Kolmogorov axiomatics
1.3 Calculation of probabilities: Laplace formula, conditional probability, Bayes formula
1.4 Discrete random variables: Bernoulli, Binomial, Poisson
1.5 Absolutely continuous random variables: Uniform distribution, Normal distribution
1.6 Mathematical expectation

Topic 2: Parameter estimation
2.1 Introduction and definitions
2.2 unbiased estimators
2.3 Point estimation
2.4 Method of moments and maximum likelihood
2.5 Statistical significance
2.6 Estimation by confidence intervals

Topic 3: Hypothesis testing
3.1 Fundamental concepts: null hypothesis and p-value
3.2 Fisher's exact test
3.3 Parametric tests: means, variances and proportions
3.4 Nonparametric tests: comparison of distributions

Topic 4: Monte Carlo simulation
4.1 The Central Limit Theorem
4.2 Lack of normality. Shapiro-Wilk and Kolmogorov-Smirnov tests
4.3 Bootstrap
4.4 Permutations test
4.5 Test of more than two samples
4.6 Approximation of the p-value

Teaching and learning activities

In person



The first four classes consist of a theoretical part (approx. 60% of the time) and a problem-solving part (40%). The fifth class consists of a laboratory of practices in which the students will work trying to solve the problems of the four deliverables of the course.

Evaluation systems and criteria

In person



"The final grade is obtained as the average of the marks of the four practices (one per subject, except subject 0). These practices are one-person and are intended to be solved autonomously, although there is no problem in asking for advice or help from the students. colleagues, as well as teachers.

If the final grade does not pass the cut-off for passing, it can be recovered with a final practical exam. "

Bibliography and resources

Main bibliography

  • Business statistics. Simple and very clear examples: https://goo.gl/aUD4be
  • Biostatistics (Rius & Wärnberg). Very complete, it offers an extensive catalog of hypothesis tests, although it is widely applied to biology: https://goo.gl/n9NHR2

Further reading

  • Probabiltats (Marta Sanz). Theoretical introduction to mathematical probability, it is a very dense and not very readable book, but of great value as a header reference: http://www.publicacions.ub.edu/ficha.aspx?cod=04980e
  • Statistics (Fortiana / Nualart). Statistical counterpart of the previous reference. In Catalan: http://www.publicacions.ub.edu/ficha.aspx?cod=04967e
  • Linear Models with R: Statistical methods aimed at the construction of linear models using R. Available online: http://www.utstat.toronto.edu/~brunner/books/LinearModelsWithR.pdf
  • Introducció a l'Anàlisi Matemàtica (Joaquim M. Ortega Aramburu, UAB Publications, 2002). Numbers, sequences, functions, series, derivatives and integrals. From scratch and with all the necessary formalities. In Catalan.
  • Basic Numerical Calculation Tools (Aubanell, Benseny, Delshams; Labor, 1993). Introductory manual to numerical calculation, with descriptions and analysis of all the classical algorithms.
  • Calculus I and II (Apostol, Ed. Reverté 1989). Very dense manual that starts from scratch and ends in surface integrals. Good reference material for its completeness and precision.