Skip to content

Notes and projects on hypothesis testing and generation for data science and analytics

License

Notifications You must be signed in to change notification settings

mustaphajola/hypo_testing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Hypothesis Generation and Testing For Data Science Projects.

Notes and projects on hypothesis generation and testing for data science and analytics.

As data sceicne and machine learning take up vital roles in solving complex problems in the business world, it is very important to understand the problem statement of a business and have proper knowledge of the business domain. In order to expose new ideas and solve these business problems, hypothesis generation and testing are the first steps a data scientist need to consider.

Intro, Definition and Reason

In general, a statistical hypothesis is a statement - an assumption - about the nature of a population and often stated in terms of a population parameter. Statistical hypothesis testing is a class of methods used to answer questions about the data under study or samples of a population. It is an analysis whereby assumptions about a population parameter is put to test. It helps data sceintists to accept or reject some assumption/claim i.e. a statistial hypothesis about the data and the potential questions that can be answered by the data.

Hypothesis generation

When a hypothesis is generated, an educated guess of the various factors that are impacting the business problem under study is made. The main difference between hypothesis generation and hypothesis testing is that hypothesis generation is the process of making an educated guess while hypothesis testing is the analysis of the guess to reach a conclusion if the guess is true/false or if there is any statistical significance between variables.

Below are some of the roles hypothesis generation play in solvign DS/ML problems:

  • Comprehension of the business problem and factors affecting the target variable.
  • Guidance on the data collection process i.e. clarity on which data sources are key in converting the business problem to a DS-based problem.
  • More in-depth knowledge about the business domain.

When should you perform hypothesis generation?

  • Hypothesis generation should be done prior to data collection (or looking at the data). An adequate hypothesis generation makes sure that all the variables in the dataset (or even those not present) have been included.

Hypothesis testing

A type 1 error rate/significance level is the probability of rejecting the null hypothesis given that it is true.

Type 1 error - a mistaken rejection of an actually true null hypothesis. Type 2 error - a failure to reject a null hypothesis taht is actually false.

The basic concept of hypothesis testing is to consider null hypothesis to be true until strong evidence is found against it. The significance level(alpha), if smaller, it will require more evidence to reject the null hypothesis

Steps to perform hypothesis testing:

  1. Set the hypothesis. 2. Set the significance level/criteria for a decision. 3. Compute the test statistics 4. Make a decision (based on p-value).

Note that we are testing the null hypothesis because we think it is wrong. If p-value is less than alpha, then we reject the null hypothesis. If the p-value is greater than alpha, we fail to reject the null hypothesis.

References

  1. Analytics Vidhya blog on hypothesis generation and testing.

  2. Science Direct

  3. Machinelearningmastery.com

About

Notes and projects on hypothesis testing and generation for data science and analytics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published