Mapping the Genome of Physarum polycephalum

The world-wide Physarum community was delighted when, in August, 2004, the National Human Genome Research Institute announced that Physarum was one of 18 organisms selected for addition to the sequencing pipeline.

For the NHGRI News Release, Click Here.

Since that time the group working on data acquisition has been very busy. The first stages of their efforts have produced a great deal of information and revealed unexpected complexities. A group publication is nearing completion and a decision has been reached to assemble a first stage high resolution genetic map, work on which is expected to begin later this year (2015). Some of the earliest communications from the Genome Coordinating Group were posted here previously (see the material below the double line), but as the complexity of these studies became more obvious, we decided to spare the community until solid results could be presented in a relatively concise format!

In the interim, you can visit the Genome Resources Website for more information. It is available here.



A Physarum Genome Coordinating Group has been formed to facilitate collaboration among interested workers. For a summary of the initial meeting of this Genome Coordinating Group, and an invitation to participate, Click Here.

This site is currently serving as a repository of "progress statements & information requests" from the Group; these appear below, in reverse chronological order. An initial report by Jonatha Gott, published in the 2004 Physarum Newsletter, may be downloaded by clicking here.
*********
November 1, 2006. Jonatha Gott files another update on the overall status of the Physarum Genome Project, asking for feedback, and the names of any others who wish to be included on her Email list. To download this update, as a WORD document, Click Here.
*********
June 26, 2006. Jonatha Gott files another update on the overall status of the Physarum Genome Project, asking for feedback. To download this update, as a WORD document, Click Here.
*********
April 28, 2006. Jonatha Gott files an additional update on the overall status of the Physarum Genome Project, focussing on some cost factors. To download this update, as a WORD document, Click Here.
*********
April 25, 2006. Jonatha Gott files an update on the overall status of the Physarum Genome Project. To download this update, as a WORD document, Click Here. To download the progress report by Gerard Pierron (referred to in the update) as a WORD document, Click Here.
*********
February 1, 2006. Jonatha Gott files an update on the overall status of the Physarum Genome Project. To download this update, as a WORD document, Click Here.
*********
February 1, 2006. Gerard Pierron posts his fourth progress report. To download this fourth report, as a WORD document, Click Here. To download the associated spreadsheet, Click Here.
*********
January 5, 2006. Gerard Pierron posts his third progress report. To download this third report, as a WORD document, Click Here.
*********
December 20, 2005. Ernst Werner reports that he "found Gerard's 'how to' instructions very useful. I successfully used them in the first try, a rare event with these programs! I found his progress reports very interesting!" To download Ernst Werner's report, as a WORD document, Click Here.
*********
December 12, 2005. Gerard Pierron posts his second progress report. To download this second report, as a WORD document, Click Here.
*********
December 1, 2005: Gerard Pierron has begun to work with the genomic data available. He proposes to post a series of progress reports. To download his first progress report, as a WORD document, Click Here.
*********
August 15, 2005 - Message from Jonatha Gott:

I could use some help on the web page. Any takers? Thanks, Jonatha

Message from Sandy Clifton:

I have not forgotten your request for access to the survey sequence data. The person who can do that is on maternity leave, and she is only on line periodically. I have contacted her again. I will let you know as soon she sets up the data on a ftp site for access.

I have another favor to ask of you. We are working on our project web pages. If you will go to http://genome.wustl.edu/ISAgenome.cgi?GENOME=Physarum%20polycephalum you will see that we have some text we are using as a placeholder until we get the info that we really need. We are going to standardize the format so that there are 3 sections: HABITAT, BIOLOGY, and SEQUENCING PLANS. I can handle the sequencing plans (essentially that the survey sequence is complete and the plan is being formulated to submit to the NHGRI), but I it woudl be good if you or some of your colleagues, who really know the organism, would write some text for the habitat and biology sections. We are trying to keep the text to a single web page, so that might be good to keep in mind.

Let me know if there is anyone whom you would like for me to contact regarding this request.
*******************
August 4, 2005 - Message from Jonatha Gott:

This is a second request for information that is to be used in generating the sequence plan for the Physarum genome project. Please take time to respond to this message, as it may make the difference between having a draft vs. a finished Physarum genome sequence.

IF YOU DON'T HAVE TIME TO SEND ME A DETAILED DESCRIPTION IMMEDIATELY, PLEASE AT LEAST LET ME KNOW WHAT YOU DO HAVE SO THAT I CAN SEND THAT INFORMATION TO SANDY CLIFTON - DETAILS CAN BE FILLED IN LATER.

Sandy has asked me to assemble a list of resources available within the Physarum community that would be useful in generating as complete a picture of the Physarum genome as possible. In particular, please let me know if you (or anyone you know) have any of the following:
1. genetic map of Physarum
2. physical map of Physarum genome
3. libraries:
BAC or other libraries with large DNA fragments
genomic libraries
cDNA libraries

4. other resources that might be useful in genome assembly and gene annotation

*** If you do have any of the above, it would be particularly helpful if you briefly described how it was generated (eg. life cycle stage, vector, average insert size, etc.). Please note that the Wash U group may be willing to sequence other available libraries as part of this project, so this could potentially be a good way to assess the quality of your current libraries at no charge! ***

For instance, I have Tim Burland's genomic libraries, one with inserts of ~1kb, the other with 1-5 kb inserts, made by Stratagene in lambda Zap, as well as two of his cDNA libraries (ClonTech "capfinder"), one from prophase, one from S phase. I have never used any of these, and would appreciate hearing if anyone has experience with them.

My intent is to make as complete a list as possible for Sandy, and to make this list available to the entire mailing list. If your reagents are not yet published and you do not wish to "share" them, let me know and that information will remain confidential. Ultimately, I would like to have a section in our database/website that contains this information to facilitate the sharing of valuable resources between labs. Any thoughts on this? Again, please feel free to send this message to anyone that isn't already on the mailing list.
*******************
June 21, 2005 - Jonatha Gott, forwarding a message from Rex Chishom:

I'm currently writing a letter in support of the Dictybase grant renewal and thought I'd pass on some of Rex Chishom's thoughts on how the Physarum database might be organized. I think that it will be a great collaboration!

From Rex Chishom:

We have been thinking a lot about the compartive genomics possibilities. What we are envisioning is first establishing a Physarum database that basically looks like dictyBase but with different colors and logo. We need to think of a name and register the appropriate domain name, probably something like physarumgenome.org or maybe even physarum.org if it is available. The database could be called "Physarum genome database (PGD)" or anything else you guys like.

But, in addition, on each dictyBase gene page and on every physarum gene page we'd like to integrate reciprocal links between orthologs/homologs. Also we are implementing technology that would allow us to show regions of synteny (if they exist) between Dicty and Physarum. We are also toying with the idea of creating something like "AmoebaBase" that could provide a single portal of access to dictyBase, Physarum, other Dictyostelids (we have requested sequence of the related species) and Acanthamoeba if the political issues can be resolved. These are just some starting ideas. Obviously we seek your input as well as that of anyone else who is interested.

Let me know how this sounds. In the meanwhile thanks for agreeing to provide a letter of collaboration. Also please let me know what I can do to help make the argument for finishing the Physarum sequence.
*******************
June 17, 2005 - Message from Jonatha Gott:

After my last, rather long, email, you may not want to hear from me again, but please take time to respond to this message, as it may make the difference between having a draft vs. a finished Physarum genome sequence.

Sandy Clifton has asked me to assemble a list of resources available within the Physarum community that would be useful in generating as complete a picture of the Physarum genome as possible. In particular, please let me know if you (or anyone you know) have any of the following:
1. genetic map of Physarum
2. physical map of Physarum genome
3. libraries:
BAC or other libraries with large DNA fragments
genomic libraries
cDNA libraries

4. other resources that might be useful in genome assembly and gene annotation

*** If you do have any of the above, it would be particularly helpful if you briefly described how it was generated (eg. life cycle stage, vector, average insert size, etc.). Please note that the Wash U group may be willing to sequence other available libraries as part of this project, so this could potentially be a good way to assess the quality of your current libraries at no charge! ***

For instance, I have Tim Burland's genomic libraries, one with inserts of ~1kb, the other with 1-5 kb inserts, made by Stratagene in lambda Zap, as well as two of his cDNA libraries (ClonTech "capfinder"), one from prophase, one from S phase. I have never used any of these, and would appreciate hearing if anyone has experience with them.

My intent is to make as complete a list as possible for Sandy, and to make this list available to the entire mailing list. If your reagents are not yet published and you do not wish to "share" them, let me know and that information will remain confidential. Ultimately, I would like to have a section in our database/website that contains this information to facilitate the sharing of valuable resources between labs. Any thoughts on this? Again, please feel free to send this message to anyone that isn't already on the mailing list.
*******************
June 17, 2005 - Message from Jonatha Gott:

Just spent about an hour on the phone with Sandy Clifton getting a "translation" of information that I sent you earlier (copied below). She answered my questions and the ones that Gerard had sent me, plus we talked a bit about their plans and what happens next. Will try to summarize our conversation below. Please read through this carefully and feel free to send me comments, questions, additional information, etc.

**ALSO, I NEED EVERYONE'S HELP IN ASSEMBLING A RESOURCE LIST, which I will discuss in my next email once I've put things in context.

SUMMARY:
What has been done: The original intent was to sequence 3 x 384 = 1152 clones in both directions, but in their experience, it is hard to assess a genome project with so few reads. They discussed it with NHGRI and got the go ahead to sequence more. Accordingly, DNA from 13,440 clones (i.e., 35 rather than 3 384 well plates) was sequenced from both ends. This process is largely automated, so not all of them generated useful sequence (cross-contamination, no DNA in some wells, vector sequence, etc.). Final result after a couple of rounds of trimming and assessment (26880 -> 22367 -> 20780) was 20,780 trimmed traces containing roughly 14 million base pairs, or nearly 10% of the genome. She wasn't sure of the details of the cloning for this particular project, but expected the plasmids to contain roughly 3-5 kb inserts. Each was sequenced from each end, generating 650-700 bp per trace. When they do the "real" sequencing, they will most likely use fosmids with inserts in the 40 kb range, which are more useful for assembling contigs, particularly since the known Tp1 retrotransposons (see below) are ~8.9 kb in length.
GC content: The G+C content of these sequences was ~40%, which is similar to the genome. However, since genes tend to be more GC rich than the intervening sequences and repeat elements, they are still not sure if the sequences they have are biased.
Repeats: Thus far they have seen ~8.84% simple repeats, which is higher than they like. They were not aware of the retrotransposons that had been reported in Physarum (nor was I), which Gerard alerted us to (see his comments below). They will look for these as well and expect the number of repeats to go up upon further analysis.
From Gerard Pierron: It is true that informations on repeated elements in Physarum are missing in the form of a specific database. However, nothing is mentionned about rDNA and also about Normann Hardman retrotransposons like sequence Tp1 which is a 8.9kb element related to LTR retrotransposons and appears to be a significant part of the repetitive DNA of Physarum. The sequence is known acc. number : X53558. A second retrotransposon Tp2 is also known (X52770). I wonder whether all the sequences were blasted or if these retrotransposons correspond to the reported 12 LTR elements ERV-classI that are mentionned for a total of 1087bp? (Sorry Gerard, I forgot to ask this question specifically!)
Contamination screen: Looks quite good. Contaminating sequences included 4 general cloning vectors, 4 of their cloning vector, 4 E.coli sequences, 9 ? (she thinks GBBCT are chloroplast sequences, but will check on that), and 28 mitochondrial sequences - not bad with 20,780 traces! Many thanks to Gerard Pierron for providing such high quality DNA (he has more in the freezer ready to send when they are ready for it).
Gene discovery: They haven't looked for individual genes yet, but have received EST sequences and will be doing that next. Thanks to Mike Gray and Gernot Gloeckner/Wolfgang Marwan for providing unpublished data for this analysis. (NB: Expect both sets of data to be published reasonably soon - still being analyzed)

Next steps:
1. compare with EST data
2. they prepare a sequencing plan to present to NHGRI in consultation with us
3. NHGRI decides how much they are willing to spend on the project and either approves or modifies sequencing plan (no set time frame - this may take a few months)
4. cloning, sequencing, assembly, and auto-finishing
5. hopefully "finishing" (i.e. actual human involvement)

Other items of note:
1. They would like to do a finished sequence. At a minimum can expect a draft sequence with 6x coverage, but think it is likely that we will end up with better than that.
2. Much of the process is automated, with computers doing most of the data analysis, including making the calls regarding sequence quality, gap filling, primer design, etc. If the data go through two rounds of "auto-finishing", usually end up with very large contigs of high quality sequence. Think it is reasonably likely that the Physarum genome will get this far. The next (and last) level of finishing is the expensive part - paying someone to sit in front of the computer to assess the sequence ~1000 bp at a time. The chances of this are considerably lower, but that's what we're trying to work towards.
3. Sandy was very glad to hear that we plan to partner with Dictybase. She felt that having a data sharing plan would make their case for a finished sequence stronger.
4. Expected error rate in the range of 1 mistake per 100,000 bases, potentially subtantially better.

Data availability:
1. Everything they do is public and all data will be made available promptly.
The individual sequence traces are added to an archive that I believe should be available to everyone through their website. Note that some of the traces that were ultimately rejected may be included in this set (eg. all 22367 traces that made the first cut may be in the archive). I'm not sure when or how to access these raw data, but if anyone is interested, I'd be happy to try to find out. You might check out their homepage first: http://genome.wustl.edu/
2. Assemblies can be posted using fasta format on their ftp site so that collaborators can use the data shortly after they are generated. I envision ALL of us having access to that site. Please check the mailing list and let me know if anyone is missing that should be on it; I certainly don't want to exclude anyone that would be interested. MARK ADELMAN, would you be willing to check my mailing list against your master Physarum mailing list and let me know who is missing?
3. The final assembly is not made available until after publication. However, she warned that lately bioinformaticists have been doing their own analysis of the data in the individual traces and publishing first. They would prefer to be the ones that publish their own data and would like to work with experts in the field to write this up in a timely manner. Please let me know if you'd like to be included in such an effort.

Finally, as I will address in a second email, the chances of getting a finished sequence will be enhanced if there are other resources freely available to them, such as genetic and/or physical maps of the genome, other libraries of all types (both genomic and cDNA). BACs would be particularly helpful. She has asked me to assemble a list of what is available, who has it, etc.
Well, enough for now. Feedback welcome, as always.
*******************
June 15, 2005 - Message from Jonatha Gott:

Here's what I have from Sandy Clifton so far. I don't know what this means for the genome project as yet, but thought I'd send it on in case any of you were interested in seeing the raw numbers thus far. I hope to talk with Sandy later this week, and will urge them to look at the EST data as soon as possible. I'll write again once I've talked with her.

Want to keep the entire process open throughout the project; please feel free to contact me with questions at any time and I'll try to get answers.

From Sandy Clifton:

I finally ran down the analysis results for the ~20K passed survey sequence reads that we perforned. Note that the repeat content is 8.84 %, but this may be low, reflecting the difficulty in cloning A/T rich regions (~60% A/T, so far). We might have to use special methods to try to capture more of the A/T rich areas. Also note that this repeat count reflects only simple repeats and areas of low complexity, since we do not have a database of repeats specific for Physarum. We have not had a chance to look at the ESTs yet. We had about 28 mitochondrial reads, as well.


And here are the results for Physarum polycephalum:
Physarum polycephalum (slime mold):
-26880 traces- (20780 passed, trimmed traces were screened) POAA-aaa01001
1) GC content- (based on phrap assembly contigs & singlets) Physarum_polycephalum 40%
2) Repeat content- (based on phrap assembly contigs & singlets...then used RepeatMasker -w -e)
NOTE: since we don't have specific repeat libraries, these stats represent simple repeats and regions of low complexity
Physarum_polycephalum 8.84 % masked
==================================================
file name: contigs_and_singlets.fas
sequences: 22367
total length: 14035327 bp (14023066 bp excl N-runs
GC level: 40.15 %
bases masked: 1239673 bp ( 8.84 %)
==================================================
number of length percentage
elements* occupied of sequence
-------------------------------------------------- SINEs: 13 1557 bp 0.01 %
ALUs 0 0 bp 0.00 %
MIRs 13 1557 bp 0.01 %

LINEs: 51 9078 bp 0.06 %
LINE1 41 8222 bp 0.06 %
LINE2 6 659 bp 0.00 %
L3/CR1 2 102 bp 0.00 %
LTR elements: 12 1087 bp 0.01 %
MaLRs 0 0 bp 0.00 %
ERVL 0 0 bp 0.00 %
ERV_classI 12 1087 bp 0.01 %
ERV_classII 0 0 bp 0.00 %

DNA elements: 3 207 bp 0.00 %
MER1_type 2 164 bp 0.00 %
MER2_type 1 43 bp 0.00 %

Unclassified: 0 0 bp 0.00 %

Total interspersed repeats: 11929 bp 0.09 %

Small RNA: 13 906 bp 0.01 %

Satellites: 1 66 bp 0.00 %
Simple repeats: 6677 447325 bp 3.19 %
Low complexity: 12548 780477 bp 5.57 %
==================================================
3) Contamination screen- (trace flagged as having contaminent if alignment of 200bp at 90% id was found)
Physarum polycephalum (slime mold):
20780 passed, trimmed traces were screened
4 UNIVEC
4 POTW13
4 ECOLI
9 GBBCT
28 MITO (Physarum polycephalum complete mito genome)

*********
December, 2004. Jonatha Gott contributed an initial status report in the Physarum Newsletter, issue 36. To download Jonatha's report, as a WORD document, Click Here.


Back to PhysarumPlus HomePage