-
Notifications
You must be signed in to change notification settings - Fork 1
/
promoters.names
84 lines (71 loc) · 3.44 KB
/
promoters.names
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
1. Title of Database: E. coli promoter gene sequences (DNA)
with associated imperfect domain theory
2. Sources:
(a) Creators:
- promoter instances: C. Harley (CHARLEY@McMaster.CA) and R. Reynolds
- non-promoter instances and domain theory: M. Noordewier
-- (non-promoters derived from work of lab of Prof. Tom Record,
University of Wisconsin Biochemistry Department)
(b) Donor: M. Noordewier and J. Shavlik, {noordewi,shavlik}@cs.wisc.edu
(c) Date received: 6/30/90
3. Past Usage:
(a) biological:
-- Harley, C. and Reynolds, R. 1987.
"Analysis of E. Coli Promoter Sequences."
Nucleic Acids Research, 15:2343-2361.
machine learning:
-- Towell, G., Shavlik, J. and Noordewier, M. 1990.
"Refinement of Approximate Domain Theories by Knowledge-Based
Artificial Neural Networks." In Proceedings of the Eighth National
Conference on Artificial Intelligence (AAAI-90).
(b) attributes predicted: member/non-member of class of sequences with
biological promoter activity (promoters initiate the process of gene
expression).
(c) Results of study indicated that machine learning techniques (neural
networks, nearest neighbor, contributors' KBANN system) performed as
well/better than classification based on canonical pattern matching
(method used in biological literature).
4. Relevant Information Paragraph:
This dataset has been developed to help evaluate a "hybrid" learning
algorithm ("KBANN") that uses examples to inductively refine preexisting
knowledge. Using a "leave-one-out" methodology, the following errors
were produced by various ML algorithms. (See Towell, Shavlik, &
Noordewier, 1990, for details.)
System Errors Comments
------ ------ --------
KBANN 4/106 a hybrid ML system
BP 8/106 std backprop with one hidden layer
O'Neill 12/106 ad hoc technique from the bio. lit.
Near-Neigh 13/106 a nearest-neighbor algo (k=3)
ID3 19/106 Quinlan's decision-tree builder
Type of domain: non-numeric, nominal (one of A, G, T, C)
-- Note: DNA nucleotides can be grouped into a hierarchy, as shown below:
X (any)
/ \
(purine) R Y (pyrimidine)
/ \ / \
A G T C
5. Number of Instances: 106
6. Number of Attributes: 59
-- class (positive or negative)
-- instance name
-- 57 sequential nucleotide ("base-pair") positions
7. Attribute information:
-- Statistics for numeric domains: No numeric features used.
-- Statistics for non-numeric domains
-- Frequencies: Promoters Non-Promoters
--------- -------------
A 27.7% 24.4%
G 20.0% 25.4%
T 30.2% 26.5%
C 22.1% 23.7%
Attribute #: Description:
============ ============
1 One of {+/-}, indicating the class ("+" = promoter).
2 The instance name (non-promoters named by position in the
1500-long nucleotide sequence provided by T. Record).
3-59 The remaining 57 fields are the sequence, starting at
position -50 (p-50) and ending at position +7 (p7). Each of
these fields is filled by one of {a, g, t, c}.
8. Missing Attribute Values: none
9. Class Distribution: 50% (53 positive instances, 53 negative instances)