Notes and projects on hypothesis generation and testing for data science and analytics.
As data sceicne and machine learning take up vital roles in solving complex problems in the business world, it is very important to understand the problem statement of a business and have proper knowledge of the business domain. In order to expose new ideas and solve these business problems, hypothesis generation and testing are the first steps a data scientist need to consider.
In general, a statistical hypothesis is a statement - an assumption - about the nature of a population and often stated in terms of a population parameter. Statistical hypothesis testing is a class of methods used to answer questions about the data under study or samples of a population. It is an analysis whereby assumptions about a population parameter is put to test. It helps data sceintists to accept or reject some assumption/claim i.e. a statistial hypothesis about the data and the potential questions that can be answered by the data.
When a hypothesis is generated, an educated guess of the various factors that are impacting the business problem under study is made. The main difference between hypothesis generation and hypothesis testing is that hypothesis generation is the process of making an educated guess while hypothesis testing is the analysis of the guess to reach a conclusion if the guess is true/false or if there is any statistical significance between variables.
Below are some of the roles hypothesis generation play in solvign DS/ML problems:
- Comprehension of the business problem and factors affecting the target variable.
- Guidance on the data collection process i.e. clarity on which data sources are key in converting the business problem to a DS-based problem.
- More in-depth knowledge about the business domain.
When should you perform hypothesis generation?
- Hypothesis generation should be done prior to data collection (or looking at the data). An adequate hypothesis generation makes sure that all the variables in the dataset (or even those not present) have been included.
A type 1 error rate/significance level is the probability of rejecting the null hypothesis given that it is true.
Type 1 error - a mistaken rejection of an actually true null hypothesis. Type 2 error - a failure to reject a null hypothesis taht is actually false.
The basic concept of hypothesis testing is to consider null hypothesis to be true until strong evidence is found against it. The significance level(alpha), if smaller, it will require more evidence to reject the null hypothesis
Steps to perform hypothesis testing:
- Set the hypothesis. 2. Set the significance level/criteria for a decision. 3. Compute the test statistics 4. Make a decision (based on p-value).
Note that we are testing the null hypothesis because we think it is wrong. If p-value is less than alpha, then we reject the null hypothesis. If the p-value is greater than alpha, we fail to reject the null hypothesis.