Identify non-coding regions from a genome annotation
$begingroup$
I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:
awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf
How I could do the inverse and keep only non coding regions?
annotation genome gtf text-processing interval
$endgroup$
add a comment |
$begingroup$
I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:
awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf
How I could do the inverse and keep only non coding regions?
annotation genome gtf text-processing interval
$endgroup$
1
$begingroup$
Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
$endgroup$
– terdon♦
10 hours ago
add a comment |
$begingroup$
I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:
awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf
How I could do the inverse and keep only non coding regions?
annotation genome gtf text-processing interval
$endgroup$
I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:
awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf
How I could do the inverse and keep only non coding regions?
annotation genome gtf text-processing interval
annotation genome gtf text-processing interval
edited 10 hours ago
Daniel Standage
2,303329
2,303329
asked 16 hours ago
Feresh TehFeresh Teh
39311
39311
1
$begingroup$
Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
$endgroup$
– terdon♦
10 hours ago
add a comment |
1
$begingroup$
Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
$endgroup$
– terdon♦
10 hours ago
1
1
$begingroup$
Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
$endgroup$
– terdon♦
10 hours ago
$begingroup$
Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
$endgroup$
– terdon♦
10 hours ago
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
If you want all transcripts from that gtf file whose type isn't "protein_coding", you can use almost the same command, just change the ==
("is") to !=
("isn't"):
awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf
Or, a simpler version:
awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf
Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.
Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):
awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf | sort | uniq -c | sort -nk1
1 "translated_processed_pseudogene";
2 "Mt_rRNA";
3 "IG_J_pseudogene";
3 "TR_D_gene";
4 "TR_J_pseudogene";
5 "TR_C_gene";
10 "IG_C_pseudogene";
18 "IG_C_gene";
18 "IG_J_gene";
22 "Mt_tRNA";
25 "3prime_overlapping_ncrna";
27 "TR_V_pseudogene";
37 "IG_D_gene";
58 "non_stop_decay";
59 "polymorphic_pseudogene";
74 "TR_J_gene";
97 "TR_V_gene";
144 "IG_V_gene";
182 "unitary_pseudogene";
196 "IG_V_pseudogene";
330 "sense_overlapping";
387 "pseudogene";
442 "transcribed_processed_pseudogene";
531 "rRNA";
802 "sense_intronic";
860 "transcribed_unprocessed_pseudogene";
1529 "snoRNA";
1923 "snRNA";
2050 "misc_RNA";
2549 "unprocessed_pseudogene";
3116 "miRNA";
9710 "antisense";
10623 "processed_pseudogene";
11780 "lincRNA";
13052 "nonsense_mediated_decay";
25955 "retained_intron";
28082 "processed_transcript";
You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:
awk '$3=="transcript" &&
$20!=""protein_coding";" &&
$20!=""translated_processed_pseudogene";"' gencode.gtf
$endgroup$
$begingroup$
Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
$endgroup$
– Feresh Teh
8 hours ago
$begingroup$
@FereshTeh you're welcome. I think you wantawk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf
but, if not, please ask a new question about that.
$endgroup$
– terdon♦
8 hours ago
$begingroup$
Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
$endgroup$
– Feresh Teh
8 hours ago
1
$begingroup$
@FereshTeh please ask a new question so you can show exactly what output you need.
$endgroup$
– terdon♦
8 hours ago
add a comment |
$begingroup$
Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.
UTR
has its own feature in the gtf file. So you can do this:
$ awk -v FS="t" '$3=="UTR"' gencode.gtf
If the gtf file is compressed use this instead:
$ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'
BTW: Why are you using such an old release of gencode? The current version is v29.
$endgroup$
$begingroup$
Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
$endgroup$
– Feresh Teh
16 hours ago
$begingroup$
Sorry I tried hat but my output is empty
$endgroup$
– Feresh Teh
16 hours ago
1
$begingroup$
As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with$ awk -v FS="t" '$3=="CDS"' gencode.gtf
. Reading the manual for bedtools complement is your task.
$endgroup$
– finswimmer
16 hours ago
$begingroup$
Sorry but your commands return nothing, I mean not working returning empty file
$endgroup$
– Feresh Teh
10 hours ago
$begingroup$
The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
$endgroup$
– terdon♦
10 hours ago
|
show 1 more comment
$begingroup$
This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.
A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type
equal to iiLocus
or fiLocus
.
Note: the --delta
parameter will extend each gene/transcript by 500bp by default.
Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.
Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.
$endgroup$
1
$begingroup$
Absolutely brilliant name! :)
$endgroup$
– terdon♦
9 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "676"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7098%2fidentify-non-coding-regions-from-a-genome-annotation%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
If you want all transcripts from that gtf file whose type isn't "protein_coding", you can use almost the same command, just change the ==
("is") to !=
("isn't"):
awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf
Or, a simpler version:
awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf
Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.
Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):
awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf | sort | uniq -c | sort -nk1
1 "translated_processed_pseudogene";
2 "Mt_rRNA";
3 "IG_J_pseudogene";
3 "TR_D_gene";
4 "TR_J_pseudogene";
5 "TR_C_gene";
10 "IG_C_pseudogene";
18 "IG_C_gene";
18 "IG_J_gene";
22 "Mt_tRNA";
25 "3prime_overlapping_ncrna";
27 "TR_V_pseudogene";
37 "IG_D_gene";
58 "non_stop_decay";
59 "polymorphic_pseudogene";
74 "TR_J_gene";
97 "TR_V_gene";
144 "IG_V_gene";
182 "unitary_pseudogene";
196 "IG_V_pseudogene";
330 "sense_overlapping";
387 "pseudogene";
442 "transcribed_processed_pseudogene";
531 "rRNA";
802 "sense_intronic";
860 "transcribed_unprocessed_pseudogene";
1529 "snoRNA";
1923 "snRNA";
2050 "misc_RNA";
2549 "unprocessed_pseudogene";
3116 "miRNA";
9710 "antisense";
10623 "processed_pseudogene";
11780 "lincRNA";
13052 "nonsense_mediated_decay";
25955 "retained_intron";
28082 "processed_transcript";
You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:
awk '$3=="transcript" &&
$20!=""protein_coding";" &&
$20!=""translated_processed_pseudogene";"' gencode.gtf
$endgroup$
$begingroup$
Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
$endgroup$
– Feresh Teh
8 hours ago
$begingroup$
@FereshTeh you're welcome. I think you wantawk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf
but, if not, please ask a new question about that.
$endgroup$
– terdon♦
8 hours ago
$begingroup$
Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
$endgroup$
– Feresh Teh
8 hours ago
1
$begingroup$
@FereshTeh please ask a new question so you can show exactly what output you need.
$endgroup$
– terdon♦
8 hours ago
add a comment |
$begingroup$
If you want all transcripts from that gtf file whose type isn't "protein_coding", you can use almost the same command, just change the ==
("is") to !=
("isn't"):
awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf
Or, a simpler version:
awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf
Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.
Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):
awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf | sort | uniq -c | sort -nk1
1 "translated_processed_pseudogene";
2 "Mt_rRNA";
3 "IG_J_pseudogene";
3 "TR_D_gene";
4 "TR_J_pseudogene";
5 "TR_C_gene";
10 "IG_C_pseudogene";
18 "IG_C_gene";
18 "IG_J_gene";
22 "Mt_tRNA";
25 "3prime_overlapping_ncrna";
27 "TR_V_pseudogene";
37 "IG_D_gene";
58 "non_stop_decay";
59 "polymorphic_pseudogene";
74 "TR_J_gene";
97 "TR_V_gene";
144 "IG_V_gene";
182 "unitary_pseudogene";
196 "IG_V_pseudogene";
330 "sense_overlapping";
387 "pseudogene";
442 "transcribed_processed_pseudogene";
531 "rRNA";
802 "sense_intronic";
860 "transcribed_unprocessed_pseudogene";
1529 "snoRNA";
1923 "snRNA";
2050 "misc_RNA";
2549 "unprocessed_pseudogene";
3116 "miRNA";
9710 "antisense";
10623 "processed_pseudogene";
11780 "lincRNA";
13052 "nonsense_mediated_decay";
25955 "retained_intron";
28082 "processed_transcript";
You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:
awk '$3=="transcript" &&
$20!=""protein_coding";" &&
$20!=""translated_processed_pseudogene";"' gencode.gtf
$endgroup$
$begingroup$
Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
$endgroup$
– Feresh Teh
8 hours ago
$begingroup$
@FereshTeh you're welcome. I think you wantawk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf
but, if not, please ask a new question about that.
$endgroup$
– terdon♦
8 hours ago
$begingroup$
Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
$endgroup$
– Feresh Teh
8 hours ago
1
$begingroup$
@FereshTeh please ask a new question so you can show exactly what output you need.
$endgroup$
– terdon♦
8 hours ago
add a comment |
$begingroup$
If you want all transcripts from that gtf file whose type isn't "protein_coding", you can use almost the same command, just change the ==
("is") to !=
("isn't"):
awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf
Or, a simpler version:
awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf
Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.
Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):
awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf | sort | uniq -c | sort -nk1
1 "translated_processed_pseudogene";
2 "Mt_rRNA";
3 "IG_J_pseudogene";
3 "TR_D_gene";
4 "TR_J_pseudogene";
5 "TR_C_gene";
10 "IG_C_pseudogene";
18 "IG_C_gene";
18 "IG_J_gene";
22 "Mt_tRNA";
25 "3prime_overlapping_ncrna";
27 "TR_V_pseudogene";
37 "IG_D_gene";
58 "non_stop_decay";
59 "polymorphic_pseudogene";
74 "TR_J_gene";
97 "TR_V_gene";
144 "IG_V_gene";
182 "unitary_pseudogene";
196 "IG_V_pseudogene";
330 "sense_overlapping";
387 "pseudogene";
442 "transcribed_processed_pseudogene";
531 "rRNA";
802 "sense_intronic";
860 "transcribed_unprocessed_pseudogene";
1529 "snoRNA";
1923 "snRNA";
2050 "misc_RNA";
2549 "unprocessed_pseudogene";
3116 "miRNA";
9710 "antisense";
10623 "processed_pseudogene";
11780 "lincRNA";
13052 "nonsense_mediated_decay";
25955 "retained_intron";
28082 "processed_transcript";
You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:
awk '$3=="transcript" &&
$20!=""protein_coding";" &&
$20!=""translated_processed_pseudogene";"' gencode.gtf
$endgroup$
If you want all transcripts from that gtf file whose type isn't "protein_coding", you can use almost the same command, just change the ==
("is") to !=
("isn't"):
awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf
Or, a simpler version:
awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf
Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.
Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):
awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf | sort | uniq -c | sort -nk1
1 "translated_processed_pseudogene";
2 "Mt_rRNA";
3 "IG_J_pseudogene";
3 "TR_D_gene";
4 "TR_J_pseudogene";
5 "TR_C_gene";
10 "IG_C_pseudogene";
18 "IG_C_gene";
18 "IG_J_gene";
22 "Mt_tRNA";
25 "3prime_overlapping_ncrna";
27 "TR_V_pseudogene";
37 "IG_D_gene";
58 "non_stop_decay";
59 "polymorphic_pseudogene";
74 "TR_J_gene";
97 "TR_V_gene";
144 "IG_V_gene";
182 "unitary_pseudogene";
196 "IG_V_pseudogene";
330 "sense_overlapping";
387 "pseudogene";
442 "transcribed_processed_pseudogene";
531 "rRNA";
802 "sense_intronic";
860 "transcribed_unprocessed_pseudogene";
1529 "snoRNA";
1923 "snRNA";
2050 "misc_RNA";
2549 "unprocessed_pseudogene";
3116 "miRNA";
9710 "antisense";
10623 "processed_pseudogene";
11780 "lincRNA";
13052 "nonsense_mediated_decay";
25955 "retained_intron";
28082 "processed_transcript";
You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:
awk '$3=="transcript" &&
$20!=""protein_coding";" &&
$20!=""translated_processed_pseudogene";"' gencode.gtf
edited 8 hours ago
answered 10 hours ago
terdon♦terdon
4,2841729
4,2841729
$begingroup$
Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
$endgroup$
– Feresh Teh
8 hours ago
$begingroup$
@FereshTeh you're welcome. I think you wantawk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf
but, if not, please ask a new question about that.
$endgroup$
– terdon♦
8 hours ago
$begingroup$
Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
$endgroup$
– Feresh Teh
8 hours ago
1
$begingroup$
@FereshTeh please ask a new question so you can show exactly what output you need.
$endgroup$
– terdon♦
8 hours ago
add a comment |
$begingroup$
Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
$endgroup$
– Feresh Teh
8 hours ago
$begingroup$
@FereshTeh you're welcome. I think you wantawk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf
but, if not, please ask a new question about that.
$endgroup$
– terdon♦
8 hours ago
$begingroup$
Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
$endgroup$
– Feresh Teh
8 hours ago
1
$begingroup$
@FereshTeh please ask a new question so you can show exactly what output you need.
$endgroup$
– terdon♦
8 hours ago
$begingroup$
Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
$endgroup$
– Feresh Teh
8 hours ago
$begingroup$
Thanks a lot, really thank you for saving me I could not solve that myself. How I can extract the below information from each line of resulting non-coding file chr1 29553 30039 ENSG00000243485.2 + gene_name "MIR1302-11"
$endgroup$
– Feresh Teh
8 hours ago
$begingroup$
@FereshTeh you're welcome. I think you want
awk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf
but, if not, please ask a new question about that.$endgroup$
– terdon♦
8 hours ago
$begingroup$
@FereshTeh you're welcome. I think you want
awk '$3=="transcript" && $20!=""protein_coding";" && $20!=""translated_processed_pseudogene";"{print $1,$4,$5,$10,$7}' gencode.gtf
but, if not, please ask a new question about that.$endgroup$
– terdon♦
8 hours ago
$begingroup$
Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
$endgroup$
– Feresh Teh
8 hours ago
$begingroup$
Thanks a lot that returns all except gene name, this is output chr1 29554 31097 "ENSG00000243485.2"; +
$endgroup$
– Feresh Teh
8 hours ago
1
1
$begingroup$
@FereshTeh please ask a new question so you can show exactly what output you need.
$endgroup$
– terdon♦
8 hours ago
$begingroup$
@FereshTeh please ask a new question so you can show exactly what output you need.
$endgroup$
– terdon♦
8 hours ago
add a comment |
$begingroup$
Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.
UTR
has its own feature in the gtf file. So you can do this:
$ awk -v FS="t" '$3=="UTR"' gencode.gtf
If the gtf file is compressed use this instead:
$ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'
BTW: Why are you using such an old release of gencode? The current version is v29.
$endgroup$
$begingroup$
Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
$endgroup$
– Feresh Teh
16 hours ago
$begingroup$
Sorry I tried hat but my output is empty
$endgroup$
– Feresh Teh
16 hours ago
1
$begingroup$
As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with$ awk -v FS="t" '$3=="CDS"' gencode.gtf
. Reading the manual for bedtools complement is your task.
$endgroup$
– finswimmer
16 hours ago
$begingroup$
Sorry but your commands return nothing, I mean not working returning empty file
$endgroup$
– Feresh Teh
10 hours ago
$begingroup$
The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
$endgroup$
– terdon♦
10 hours ago
|
show 1 more comment
$begingroup$
Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.
UTR
has its own feature in the gtf file. So you can do this:
$ awk -v FS="t" '$3=="UTR"' gencode.gtf
If the gtf file is compressed use this instead:
$ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'
BTW: Why are you using such an old release of gencode? The current version is v29.
$endgroup$
$begingroup$
Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
$endgroup$
– Feresh Teh
16 hours ago
$begingroup$
Sorry I tried hat but my output is empty
$endgroup$
– Feresh Teh
16 hours ago
1
$begingroup$
As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with$ awk -v FS="t" '$3=="CDS"' gencode.gtf
. Reading the manual for bedtools complement is your task.
$endgroup$
– finswimmer
16 hours ago
$begingroup$
Sorry but your commands return nothing, I mean not working returning empty file
$endgroup$
– Feresh Teh
10 hours ago
$begingroup$
The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
$endgroup$
– terdon♦
10 hours ago
|
show 1 more comment
$begingroup$
Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.
UTR
has its own feature in the gtf file. So you can do this:
$ awk -v FS="t" '$3=="UTR"' gencode.gtf
If the gtf file is compressed use this instead:
$ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'
BTW: Why are you using such an old release of gencode? The current version is v29.
$endgroup$
Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.
UTR
has its own feature in the gtf file. So you can do this:
$ awk -v FS="t" '$3=="UTR"' gencode.gtf
If the gtf file is compressed use this instead:
$ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'
BTW: Why are you using such an old release of gencode? The current version is v29.
edited 16 hours ago
answered 16 hours ago
finswimmerfinswimmer
962210
962210
$begingroup$
Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
$endgroup$
– Feresh Teh
16 hours ago
$begingroup$
Sorry I tried hat but my output is empty
$endgroup$
– Feresh Teh
16 hours ago
1
$begingroup$
As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with$ awk -v FS="t" '$3=="CDS"' gencode.gtf
. Reading the manual for bedtools complement is your task.
$endgroup$
– finswimmer
16 hours ago
$begingroup$
Sorry but your commands return nothing, I mean not working returning empty file
$endgroup$
– Feresh Teh
10 hours ago
$begingroup$
The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
$endgroup$
– terdon♦
10 hours ago
|
show 1 more comment
$begingroup$
Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
$endgroup$
– Feresh Teh
16 hours ago
$begingroup$
Sorry I tried hat but my output is empty
$endgroup$
– Feresh Teh
16 hours ago
1
$begingroup$
As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with$ awk -v FS="t" '$3=="CDS"' gencode.gtf
. Reading the manual for bedtools complement is your task.
$endgroup$
– finswimmer
16 hours ago
$begingroup$
Sorry but your commands return nothing, I mean not working returning empty file
$endgroup$
– Feresh Teh
10 hours ago
$begingroup$
The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
$endgroup$
– terdon♦
10 hours ago
$begingroup$
Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
$endgroup$
– Feresh Teh
16 hours ago
$begingroup$
Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
$endgroup$
– Feresh Teh
16 hours ago
$begingroup$
Sorry I tried hat but my output is empty
$endgroup$
– Feresh Teh
16 hours ago
$begingroup$
Sorry I tried hat but my output is empty
$endgroup$
– Feresh Teh
16 hours ago
1
1
$begingroup$
As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with
$ awk -v FS="t" '$3=="CDS"' gencode.gtf
. Reading the manual for bedtools complement is your task.$endgroup$
– finswimmer
16 hours ago
$begingroup$
As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with
$ awk -v FS="t" '$3=="CDS"' gencode.gtf
. Reading the manual for bedtools complement is your task.$endgroup$
– finswimmer
16 hours ago
$begingroup$
Sorry but your commands return nothing, I mean not working returning empty file
$endgroup$
– Feresh Teh
10 hours ago
$begingroup$
Sorry but your commands return nothing, I mean not working returning empty file
$endgroup$
– Feresh Teh
10 hours ago
$begingroup$
The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
$endgroup$
– terdon♦
10 hours ago
$begingroup$
The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
$endgroup$
– terdon♦
10 hours ago
|
show 1 more comment
$begingroup$
This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.
A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type
equal to iiLocus
or fiLocus
.
Note: the --delta
parameter will extend each gene/transcript by 500bp by default.
Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.
Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.
$endgroup$
1
$begingroup$
Absolutely brilliant name! :)
$endgroup$
– terdon♦
9 hours ago
add a comment |
$begingroup$
This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.
A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type
equal to iiLocus
or fiLocus
.
Note: the --delta
parameter will extend each gene/transcript by 500bp by default.
Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.
Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.
$endgroup$
1
$begingroup$
Absolutely brilliant name! :)
$endgroup$
– terdon♦
9 hours ago
add a comment |
$begingroup$
This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.
A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type
equal to iiLocus
or fiLocus
.
Note: the --delta
parameter will extend each gene/transcript by 500bp by default.
Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.
Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.
$endgroup$
This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.
A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type
equal to iiLocus
or fiLocus
.
Note: the --delta
parameter will extend each gene/transcript by 500bp by default.
Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.
Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.
edited 10 hours ago
answered 10 hours ago
Daniel StandageDaniel Standage
2,303329
2,303329
1
$begingroup$
Absolutely brilliant name! :)
$endgroup$
– terdon♦
9 hours ago
add a comment |
1
$begingroup$
Absolutely brilliant name! :)
$endgroup$
– terdon♦
9 hours ago
1
1
$begingroup$
Absolutely brilliant name! :)
$endgroup$
– terdon♦
9 hours ago
$begingroup$
Absolutely brilliant name! :)
$endgroup$
– terdon♦
9 hours ago
add a comment |
Thanks for contributing an answer to Bioinformatics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7098%2fidentify-non-coding-regions-from-a-genome-annotation%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
$endgroup$
– terdon♦
10 hours ago