The objective is to identify certain locations in a long series made out of 4 letters: atgc.
This online application does exactly what I am trying to do:
<[login to view URL]>
Just copy and paste this sequence in the program and check the output:
ctgaaccgtctcattcccaggcaggatcggtcctgtatttcaggtctaatttttctaattcattaggtcttgtaacctttcctttttaggggccccccaaccctaacctcaggtagtaatgtgtatccattccaggggctctatatccctctctccagtcttccactgccttggttcacagacggttctctccactcccgacagatcgggtgcttgttggatttcaggtgcctgtcccttcctatcccaggtcctgcccctcttctctttcagatccgctgccgcctccaccctagatgttgccccattgctgcttcaggacttttct
This problem will be solved by using classification in a matlab program to be
trained and learn some rules from true and false data.
Once the program is trained and learns the rules (these are called classifiers)
we can give the program to read a new sequence and score it accordingly.
training Input:
gaggtatgttcagtgagtcaggtattcctggtgagtgaggtgagc
training Output (it is the same as the input but it splits in rows.):
gaggtatgt
tcagtgagt
caggtattc
ctggtgagt
gaggtgagc
It is like this just to get the idea:
Given that 123456 is correct and 214536 is wrong.
The selection criteria is 45.
If I give you 124456. Is it true or false?
By just looking you can give an estimate or a percentage that it is true.
If I give you 11111111112345633333333 the program must find the pattern 123456
and score it 100% correct because it is exactly the same as the training data.
In my project you have a lot of true examples and a lot of false examples.
The program will be trained using these examples and learn the rules for selection (classifiers) and when given a new sequence it will show the percentage of ag and gt being true every time they are located.
In the zip file I included a more detailed description & data.
## Deliverables
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
## Platform
Matlab