If you wonder why clustering with cd-hit of a recent nr database from NCBI takes ages, here’s an answer:
>gi|10955428|ref|NP_053140.1| hypothetical protein pB171_078 [Escherichia coli]gi|16082681|ref|NP_395228.1| transposase/IS protein [Yersinia pestis CO92]gi|16082847|ref|NP_395401.1| transposase/IS protein [Yersinia pestis CO92]gi|16120383|ref|NP_403696.1| transposase/IS protein [Yersinia pestis CO92]gi|16120444|ref|NP_4 03757.1| transposase/IS protein [Yersinia pestis CO92]gi|16120514|ref|NP_403827.1| transposase/IS protein [ Yersinia pestis CO92]gi|16120586|ref|NP_403899.1| transposase/IS protein [Yersinia pestis CO92]gi|16120719| ref|NP_404032.1| transposase/IS protein [Yersinia pestis CO92]gi|16120857|ref|NP_404170.1| transposase/IS p rotein [Yersinia pestis CO92]gi|16120894|ref|NP_404207.1| transposase/IS protein [Yersinia pestis CO92]gi|1 6120962|ref|NP_404275.1| transposase/IS protein [Yersinia pestis CO92]gi|16121092|ref|NP_404405.1| transpos ase/IS protein [Yersinia pestis CO92]gi|16121136|ref|NP_404449.1| transposase/IS protein [Yersinia pestis C O92]gi|16121228|ref|NP_404541.1| transposase/IS protein [Yersinia pestis CO92]gi|16121314|ref|NP_404627.1| transposase/IS protein [Yersinia pestis CO92]gi|16121385|ref|NP_404698.1| transposase/IS protein [Yersinia pestis CO92]gi|16121430|ref|NP_404743.1| transposase/IS protein [Yersinia pestis CO92]gi|16121620|ref|NP_40 4933.1| transposase/IS protein [Yersinia pestis CO92]gi|16121706|ref|NP_405019.1| transposase/IS protein [Y ersinia pestis CO92]gi|16121792|ref|NP_405105.1| transposase/IS protein [Yersinia pestis CO92]gi|16121890|r ef|NP_405203.1| transposase/IS protein [Yersinia pestis CO92]gi|16121951|ref|NP_405264.1| transposase/IS pr otein [Yersinia pestis CO92]gi|16121988|ref|NP_405301.1| transposase/IS protein [Yersinia pestis CO92]gi|16 122008|ref|NP_405321.1| transposase/IS protein [Yersinia pestis CO92]gi|16122148|ref|NP_405461.1| transposa se/IS protein [Yersinia pestis CO92]gi|16122266|ref|NP_405579.1| transposase/IS protein [Yersinia pestis CO 92]gi|16122324|ref|NP_405637.1| transposase/IS protein [Yersinia pestis CO92]gi|16122408|ref|NP_405721.1| t ransposase/IS protein [Yersinia pestis CO92]gi|16122588|ref|NP_405901.1| transposase/IS protein [Yersinia p estis CO92]gi|16122620|ref|NP_405933.1| transposase/IS protein [Yersinia pestis CO92]gi|16122738|ref|NP_406 051.1| transposase/IS protein [Yersinia pestis CO92]gi|16122852|ref|NP_406165.1| transposase/IS protein [Ye rsinia pestis CO92]gi|16122926|ref|NP_406239.1| transposase/IS protein [Yersinia pestis CO92]gi|16123007|re f|NP_406320.1| transposase/IS protein [Yersinia pestis CO92]gi|16123118|ref|NP_406431.1| transposase/IS pro tein [Yersinia pestis CO92]gi|16123368|ref|NP_406681.1| transposase/IS protein [Yersinia pestis CO92]gi|161 23410|ref|NP_406723.1| transposase/IS protein [Yersinia pestis CO92]gi|16123439|ref|NP_406752.1| transposas e/IS protein [Yersinia pestis CO92]gi|16123584|ref|NP_406897.1| transposase/IS protein [Yersinia pestis CO9 2]gi|16123688|ref|NP_407001.1| transposase/IS protein [Yersinia pestis CO92]gi|16123734|ref|NP_407047.1| tr ansposase/IS protein [Yersinia pestis CO92]gi|16123839|ref|NP_407152.1| transposase/IS protein [Yersinia pe stis CO92]gi|16123892|ref|NP_407205.1| transposase/IS protein [Yersinia pestis CO92]gi|16123908|ref|NP_4072 21.1| transposase/IS protein [Yersinia pestis CO92]gi|16124133|ref|NP_407446.1| transposase/IS protein [Yer sinia pestis CO92]gi|22123963|ref|NP_667386.1| transposase/IS protein [Yersinia pestis KIM]gi|22124031|ref| NP_667454.1| transposase/IS protein [Yersinia pestis KIM]gi|22124203|ref|NP_667626.1| transposase/IS protei n [Yersinia pestis KIM]gi|22124372|ref|NP_667795.1| transposase/IS protein [Yersinia pestis KIM]gi|22124391 |ref|NP_667814.1| transposase/IS protein [Yersinia pestis KIM]gi|22124420|ref|NP_667843.1| transposase/IS p rotein [Yersinia pestis KIM]gi|22124556|ref|NP_667979.1| transposase/IS protein [Yersinia pestis KIM]gi|221 24665|ref|NP_668088.1| transposase/IS protein [Yersinia pestis KIM]gi|22124814|ref|NP_668237.1| transposase /IS protein [Yersinia pestis KIM]gi|22124844|ref|NP_668267.1| transposase/IS protein [Yersinia pestis KIM]g i|22124913|ref|NP_668336.1| transposase/IS protein [Yersinia pestis KIM]gi|22125025|ref|NP_668448.1| transp osase/IS protein [Yersinia pestis KIM]gi|22125118|ref|NP_668541.1| transposase/IS protein [Yersinia pestis KIM]gi|22125219|ref|NP_668642.1| transposase/IS protein [Yersinia pestis KIM]gi|22125447|ref|NP_668870.1| t ransposase/IS protein [Yersinia pestis KIM]gi|22125565|ref|NP_668988.1| transposase/IS protein [Yersinia pe stis KIM]gi|22125833|ref|NP_669256.1| transposase/IS protein [Yersinia pestis KIM]gi|22125913|ref|NP_669336 .1| transposase/IS protein [Yersinia pestis KIM]gi|22126032|ref|NP_669455.1| transposase/IS protein [Yersin ia pestis KIM]gi|22126111|ref|NP_669534.1| transposase/IS protein [Yersinia pestis KIM]gi|22126227|ref|NP_6 69650.1| transposase/IS protein [Yersinia pestis KIM]gi|22126294|ref|NP_669717.1| transposase/IS protein [Y ersinia pestis KIM]gi|22126458|ref|NP_669881.1| transposase/IS protein [Yersinia pestis KIM]gi|22126621|ref |NP_670044.1| transposase/IS protein [Yersinia pestis KIM]gi|22126672|ref|NP_670095.1| transposase/IS prote in [Yersinia pestis KIM]gi|22126967|ref|NP_670390.1| transposase/IS protein [Yersinia pestis KIM]gi|2212702 6|ref|NP_670449.1| transposase/IS protein [Yersinia pestis KIM]gi|22127088|ref|NP_670511.1| transposase/IS protein [Yersinia pestis KIM]gi|22127284|ref|NP_670707.1| transposase/IS protein [Yersinia pestis KIM]gi|22 127489|ref|NP_670912.1| transposase/IS protein [Yersinia pestis KIM]gi|22127607|ref|NP_671030.1| transposas e/IS protein [Yersinia pestis KIM]gi|22127670|ref|NP_671093.1| transposase/IS protein [Yersinia pestis KIM] gi|22127690|ref|NP_671113.1| transposase/IS protein [Yersinia pestis KIM]gi|22127900|ref|NP_671323.1| trans posase/IS protein [Yersinia pestis KIM]gi|31795384|ref|NP_857837.1| transposase/IS protein [Yersinia pestis KIM]gi|31795462|ref|NP_857912.1| transposase/IS protein [Yersinia pestis KIM]gi|32470047|ref|NP_862989.1| putative ATP-binding protein [Escherichia coli]gi|45439896|ref|NP_991435.1| transposase/IS protein [Yersini a pestis biovar Microtus str. 91001]gi|45439948|ref|NP_991487.1| transposase/IS protein [Yersinia pestis bi ovar Microtus str. 91001]gi|45440109|ref|NP_991648.1| transposase/IS protein [Yersinia pestis biovar Microt us str. 91001]gi|45440257|ref|NP_991796.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 910 01]gi|45440297|ref|NP_991836.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 91001]gi|45440 401|ref|NP_991940.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 91001]
But to tell the honest true, this is not a problem – this is less than 10% of only one of many other problems. This particular protein (gi number: 10955428) has over three hundred other gi numbers in its header in non-redundant database from NCBI, which apparently made cd-hit stand still in amusement of such a lengthy description for weeks. Quick fix in Perl, and now the clustering is going to be finished within few hours, as it should.
Kay at Suicyte
May 25, 2008 at 20:36
A lot of programs choke on excessively long FASTA headers. Readseq is one example that comes to mind. I routinely pass FASTA-formatted database sequences through a ‘cut’ filter before passing them on to sequence analysis programs
Pawel Szczesny
May 28, 2008 at 16:06
I must admit it happened to me first time. Last time I did some clustering (maybe a year ago) I didn’t notice any issues but on the other hand new things appear in the GenBank much faster than before.