Streamlining RNA-Seq Data Processing with MATLAB
Written on
Chapter 1: Introduction to RNA-Seq Data Processing
In this section, I will share my experience using MATLAB for the pre-processing of RNA-Seq data utilizing the bowtie program. As a beginner in coding, this narrative reflects my journey rather than serving as a definitive guide. I encourage any advice on improving data handling or MATLAB practices.
Previously, I discussed the foundational concepts behind RNA-Seq analysis in two earlier articles. The objective of my project was to identify the transcriptional start site, which indicates the precise location where RNA synthesis begins within a gene.
To achieve this, I first purified the unique start of the RNA and attached a designated RNA sequence, known as a 5' RNA adapter, at the RNA's beginning. Using reverse transcriptase, a viral enzyme, I converted all RNA into complementary DNA (cDNA). I amplified the cDNA specific to my gene of interest using a specific primer, and subsequently sequenced the amplified cDNA to analyze the resulting sequence reads.
Chapter 2: Data Import and Initial Processing
The sequence data was provided to me as a FASTA file. I must admit that while the code I wrote initially produced the expected results, I did not instruct MATLAB to generate the correct output for the bowtie program, as discussed in my second article.
The code I present here processes the data in a manner compatible with bowtie.
Importing Files and Reverse Complementing Reads
My first task involved reading the fastq data files, which MATLAB simplified via a built-in function that imports the file into a structured format. Since I requested paired-end reads, I imported two files:
CopyFASTQStruct1 = fastqread('97–21-K_R1_001.fastq');
FASTQStruct2 = fastqread('97–21-K_R2_001.fastq');
To clarify paired-end reads, they allow for the sequencing of longer RNA or DNA fragments. The two imported files represent the two reads.
As read 2 is derived from the opposite end and direction of read 1, I needed to obtain the reverse complement of read 2 to align the sequences correctly. The following code accomplished this:
CopyFASTQStruct2RC = FASTQStruct2;
N2 = numel(FASTQStruct2);
for n = 1:N2;
FASTQStruct2RC(n).Sequence = seqrcomplement(FASTQStruct2RC(n).Sequence);
end;
Searching for Specific Sequences
Next, I aimed to locate my 5' adapter and gene-specific primer within these reads. To streamline the search, I merged the two sequences for alignment:
for n = 1:N2;
FASTQStruct2RC(n).MergeSeq = [FASTQStruct1(n).Sequence, FASTQStruct2RC(n).Sequence];
end;
Following that, I converted structure files into tables for easier management:
FASTQtable1 = struct2table(FASTQStruct1);
FASTQtable1.Properties.VariableNames = {'Header1', 'Sequence1', 'Quality1'};
FASTQtable2RC = struct2table(FASTQStruct2RC);
FASTQtable2RC.Properties.VariableNames = {'Header2', 'Sequence2', 'Quality2', 'MergeSeq'};
FASTQtableConcat = [FASTQtable1, FASTQtable2RC];
Now I will search for the specific sequences:
RNA5prAdapter = "GTTCAGAGTTCTACAGTCCGACGATC";
The MATLAB function contains searches each merged sequence for the sequences I was interested in and marks the results in a new column.
Orientation of Reads
To ensure uniformity, I wanted all my 5' adapters positioned to the left. If a reverse complement of the 5' adapter was detected, both reads were reversed:
FASTQOrientedStruct = table2struct(FASTQtableConcat);
N = height(FASTQtableConcat);
for n = 1:N;
if FASTQOrientedStruct(n).RNA5prAdapterRC == 1
L = FASTQOrientedStruct(n).Sequence1;
R = FASTQOrientedStruct(n).Sequence2;
FASTQOrientedStruct(n).Sequence1 = seqrcomplement(R);
FASTQOrientedStruct(n).Sequence2 = seqrcomplement(L);
end
end
FASTQOrientedTable = struct2table(FASTQOrientedStruct);
At each transformation phase, I performed checks to ensure the data was processed correctly.
Filtering Sequences
With the reads oriented, filtering became simpler. I retained sequences containing a 5' adapter on the left and a primer on the right:
FilteredSeqTable1 = FASTQOrientedTable(FASTQOrientedTable.RNA5prAdapter == 1,:);
SjFilteredSeqTable1 = FilteredSeqTable1(FilteredSeqTable1.SjNR226RC == 1,:);
SosFilteredSeqTable1 = FilteredSeqTable1(FilteredSeqTable1.SosMO484RC == 1,:);
Next, I removed any sequences not part of the RNA, editing out the 5' adapter using the extractAfter function.
Final Checks and Exporting Data
I performed final quality checks to ensure all RNA 5' adapters had been removed while confirming the presence of primers.
Subsequently, I converted the filtered data into FASTQ files, ensuring quality scores were aligned with the sequences:
fastqwrite('97_21_K_Sj.fq', SjStruct);
fastqwrite('97_21_K_Sos.fq', SosStruct);
These data files were then utilized in the bowtie alignment program, which I elaborated on in the second article.
Conclusion
In summary, we explored how I utilized MATLAB to preprocess RNA-Seq read data for analysis in a bowtie program. The steps included:
- Reading Genewiz FASTQ files
- Reverse complementing the second read pair
- Filtering for sequences containing both the 5' RNA adapter and gene-specific sequences
- Orienting the sequences for consistency
- Trimming excess data and saving as new FASTQ files.
For further insights into the experiment and data files, refer to the previous articles, although the coding specifics will not be discussed there.
Chapter 3: Analyzing RNA-Seq Data
The first video titled "How to analyze RNA-Seq data? Find differentially expressed genes in your research." provides an overview of methods for analyzing RNA-Seq data, focusing on identifying differentially expressed genes.
The second video, "How to Perform RNA-seq Analysis - YouTube," offers a comprehensive guide on conducting RNA-Seq analysis, perfect for beginners and experienced researchers alike.