Hi, I tried searching the forums, but could not get anything on this topic.
What I want to do is simple - given 2 indices i,j, return the substring which occurs between them.
My sample input.txt file is:
---------------------------------------------------------------------
>chromosome_1..............1,000,000-2,000,000
AGGCATAGAATTAGACATAAGGTCTCTTCT
AGAAGGGGTTATCCTCAGGCTTGAGGCAT
---------------------------------------------------------------------
Assume that each of the 3 lines in the above file is 30 characters
long. And except the first header line which starts with a ">", all
other lines have characters randomly chosen from the character set
{A,G,C,T}.
So if I say - return substring (40,70), I should get in return
only the underlined characters below as a single line i.e. without the \n
at the end of the second line.
---------------------------------------------------------------------
>chromosome_1..............1,000,000-2,000,000
AGGCATAGAATTAGACATAAGGTCTCTTCT
AGAAGGGGTTATCCTCAGGCTTGAGGCAT
---------------------------------------------------------------------
On the other hand, if I say return substring 70 - 200, I
should just get the last 20 characters and if I say return substring
300-400, I should get nothing.
I am ok using perl -ep or grep or egrep for this task, however, I am
not able to figure out or construct the right expression for this.
I have tried things on the lines of extracting the first match only
and then use a backreference to exclude that from the string. But I am
not conceptually clear and the patterns are not working. Also, I think
it could be done in a simpler way.
Any help would be greatly appreciated.
PS: I am a newbie, so kindly provide any pointers or a few sentences of explanation if about the solution.