Dataset overview & statistics & download & structure

  Dataset overview

BC: Base calling;   PD: PolyA Detecttion;   SA: Segmentation and Event Alignmnet;   MD: Modification Detecion.
Dataset Publish Date Accession Number Species Type Sample Flowcell_type Sequencing_kit   BC     PD     SA     MD  
ont_ployA_standard 2018-09
PRJEB28423 Synthetic RNA 10xpolyA flo-min106 sqk-rna001
15xpolyA flo-min106 sqk-rna001
30xpolyA flo-min106 sqk-rna001
60xpolyA flo-min106 sqk-rna001
80xpolyA flo-min106 sqk-rna001
100xpolyA flo-min106 sqk-rna001
eGFP_polyA_DNA 2019-06 PRJEB31806 Synthetic cDNA dna_rep1_sqklsk108_flipflop flo-min106 sqk-lsk108
dna_rep2_sqklsk109_flipflop flo-min106 sqk-lsk109
eGFP_polyA_RNA 2019-06 PRJEB31806 Synthetic RNA rna_rep1_sqkrna001_plus_rt flo-min106 sqk-rna001
rna_rep2_sqkrna001_plus_rt flo-min106 sqk-rna001
rna_rep3_sqkrna002_minus_rt flo-min106 sqk-rna002
lambda_phage 2021-03 PRJNA926802 lambda phage DNA VER5940 flo-flg001 sqk-lsk109
NA12878 2019-06 PRJEB23027 Homo sapiens DNA FAB42828 flo-min106 sqk-lsk108
FAF04090 flo-min106 sqk-lsk108
FAF09968 flo-min106 sqk-lsk108
curlcake 2019-07 PRJNA511582 Synthetic RNA m6A-mod-rep1 flo-min106 sqk-rna001 m6A
m6A-mod-rep2 flo-min106 sqk-rna001 m6A
non-mod-rep1 flo-min106 sqk-rna001
non-mod-rep2 flo-min106 sqk-rna001
scBY4741_m5C 2021-06 PRJNA563591 Synthetic RNA m5C_modified flo-min106 sqk-rna001 m5C
scBY4741_hm5C 2021-06 PRJNA548268 Synthetic RNA hm5C_modified flo-min106 sqk-rna001 hm5C
scBY4741_pU 2021-02 PRJNA549001 Synthetic RNA pU_modified flo-min106 sqk-rna001 Ψ
hct116 2021-04 PRJEB44348 Homo sapiens RNA HCT-WT-rep1 flo-min106 sqk-rna002 m6A
HCT-WT-rep2 flo-min106 sqk-rna002 m6A
HCT-WT-rep3 flo-min106 sqk-rna002 m6A
hek293t_wt 2021-01 PRJEB40872 Homo sapiens RNA HEK293T-WT-rep1 flo-min106 sqk-rna001 m6A
HEK293T-WT-rep2 flo-min106 sqk-rna002 m6A
HEK293T-WT-rep3 flo-min106 sqk-rna002 m6A
hek293t_ko 2021-01 PRJEB40872 Homo sapiens RNA HEK293T-Mettl3-KO-rep1 flo-min106 sqk-rna001
HEK293T-Mettl3-KO-rep2 flo-min106 sqk-rna002
HEK293T-Mettl3-KO-rep3 flo-min106 sqk-rna002
mESCs_eligos 2020-10 PRJNA497103 Mus musculus RNA mESCs_Mettl3_WT flo-min106 sqk-rna002 m6A
mESCs_Mettl3_KO flo-min106 sqk-rna002
ecoli_eligos 2020-08 PRJNA497103 Escherichia coli RNA IVT_Inosine flo-min106 sqk-rna002 Inosine
IVT_m5C flo-min106 sqk-rna002 m5C
IVT_m6A flo-min106 sqk-rna002 m6A
IVT_normalA flo-min106 sqk-rna002
IVT_normalC flo-min106 sqk-rna002
dinopore_ivt 2023-01 SRP363295 Synthetic RNA gBlock_pureI flo-min106 sqk-rna001 Inosine
gBlock_G flo-min106 sqk-rna001
dinopore_xenopus 2022-04 SRP363295 Xenopus lavies RNA rep3_stage1_20200812 flo-min106 sqk-rna002 Inosine
rep3_stage1_20201005 flo-min106 sqk-rna002 Inosine
rep3_stage9_20200812 flo-min106 sqk-rna002 Inosine
rep3_stage9_20201008 flo-min106 sqk-rna002 Inosine

  Dataset statistics

* The base called sequences are from Guppy 6.0.1.
Dataset   Type     Raw data size   Sample # multi_fast5   # reads   Avg. current signal length Avg. base sequence length*
ont_ployA_standard RNA 81 GB 10xpolyA 24 92,428 59001.85 1207.22
15xpolyA 23 91,084 56518.49 1216.28
30xpolyA 16 63,886 54111.54 1192.65
60xpolyA 28 108,314 57397.07 1172.57
80xpolyA 103 409,634 47166.28 859.32
100xpolyA 70 279,895 61938.01 1173.39
eGFP_polyA_DNA cDNA 43 GB dna_rep1_sqklsk108_flipflop 121 484,000 8956.69 763.46
dna_rep2_sqklsk109_flipflop 71 280,428 21619.23 1667.14
eGFP_polyA_RNA RNA 529 GB rna_rep1_sqkrna001_plus_rt 231 922,826 57068.67 1126.53
rna_rep2_sqkrna001_plus_rt 364 1,452,042 50103.37 928.41
rna_rep3_sqkrna002_minus_rt 149 592,571 30888.61 465.02
lambda_phage DNA 19 GB VER5940 114 113,514 116272.62 9561.99
NA12878 DNA 68 GB FAB42828 9 33,633 131148.91 6810.35
FAF04090 24 62,833 509826.89 17801.22
FAF09968 6 21,947 334920.97 53615.01
curlcake RNA 584 GB m6A-mod-rep1 34 134,374 69745.77 850.16
m6A-mod-rep2 160 638,860 58341.88 835.01
non-mod-rep1 17 66,736 57930.6 866.98
non-mod-rep2 212 846,595 61719.51 1066.53
scBY4741_m5C RNA 37 GB m5C_modified 104 415,453 40792.42 539.89
scBY4741_hm5C RNA 17 GB hm5C_modified 28 111,015 81528.2 1022.88
scBY4741_pU RNA 4 GB pU_modified 11 42,386 46652.89 475.18
hct116 RNA 346 GB HCT-WT-rep1 247 987,488 66363.12 1217.43
HCT-WT-rep2 254 1,015,893 57524.51 1023.03
HCT-WT-rep3 419 1,673,394 65628.29 1153.23
hek293t_wt RNA 224 GB HEK293T-WT-rep1 261 1,040,661 60169.77 939.8
HEK293T-WT-rep2 349 1,396,000 54077.71 1077.61
HEK293T-WT-rep3 133 513,561 56785.55 1005.06
hek293t_ko RNA 356 GB HEK293T-Mettl3-KO-rep1 373 1,490,210 58140.7 952.63
HEK293T-Mettl3-KO-rep2 454 1,815,589 52569.78 993.85
HEK293T-Mettl3-KO-rep3 421 1,677,075 50185.96 970.32
mESCs_eligos RNA 220 GB mESCs_Mettl3_WT 791 3,163,286 33202.35 526.23
mESCs_Mettl3_KO 382 1,527,561 28350.74 437.70
ecoli_eligos RNA 214 GB IVT_Inosine 203 811,953 32978.04 845.43
IVT_m5C 144 573,674 45397.06 719.52
IVT_m6A 371 1,482,437 41642.13 708.29
IVT_normalA 96 383,209 33499.75 620.83
IVT_normalC 114 452,806 44566.76 731.75
dinopore_ivt RNA 15 GB gBlock_pureI 47 165,628 29869.74 450.32
gBlock_G 43 150,405 32047.08 641.17
dinopore_xenopus RNA 399 GB rep3_stage1_20200812 363 1,451,289 46688.45 917.23
rep3_stage1_20201005 454 1,812,200 27213.72 532.63
rep3_stage9_20200812 391 1,560,032 44621.79 894.37
rep3_stage9_20201008 313 1,251,130 31185.45 448.15

  Raw dataset download

All raw dataset can be downloaded from another page, Raw Data Download, where users can download the raw fast5 files directly.


  Processed dataset download

All processed dataset can be downloaded from Zenodo.

The dataset is dynamic and the current version is 1.0.0. We continue to update more processed dataset of different software.


  Processed dataset structure

The general structure of each dataset is as the following picture (The example dataset has two samples: sample_0 and sample_1). However, not all datasets contain all the modules. For example, Tailfindr and SegPore can not process the DNA data, so DNA datasets usually do not have the 3_tailfindr and 6_segpore. Some datasets only contain a few modules, and we are continuing to update them.