If you wonder why clustering with cd-hit of a recent nr database from NCBI takes ages, here’s an answer:
>gi|10955428|ref|NP_053140.1| hypothetical protein pB171_078 [Escherichia coli]gi|16082681|ref|NP_395228.1| transposase/IS protein [Yersinia pestis CO92]gi|16082847|ref|NP_395401.1| transposase/IS protein [Yersinia pestis CO92]gi|16120383|ref|NP_403696.1| transposase/IS protein [Yersinia pestis CO92]gi|16120444|ref|NP_4 03757.1| transposase/IS protein [Yersinia pestis CO92]gi|16120514|ref|NP_403827.1| transposase/IS protein [ Yersinia pestis CO92]gi|16120586|ref|NP_403899.1| transposase/IS protein [Yersinia pestis CO92]gi|16120719| ref|NP_404032.1| transposase/IS protein [Yersinia pestis CO92]gi|16120857|ref|NP_404170.1| transposase/IS p rotein [Yersinia pestis CO92]gi|16120894|ref|NP_404207.1| transposase/IS protein [Yersinia pestis CO92]gi|1 6120962|ref|NP_404275.1| transposase/IS protein [Yersinia pestis CO92]gi|16121092|ref|NP_404405.1| transpos ase/IS protein [Yersinia pestis CO92]gi|16121136|ref|NP_404449.1| transposase/IS protein [Yersinia pestis C O92]gi|16121228|ref|NP_404541.1| transposase/IS protein [Yersinia pestis CO92]gi|16121314|ref|NP_404627.1| transposase/IS protein [Yersinia pestis CO92]gi|16121385|ref|NP_404698.1| transposase/IS protein [Yersinia pestis CO92]gi|16121430|ref|NP_404743.1| transposase/IS protein [Yersinia pestis CO92]gi|16121620|ref|NP_40 4933.1| transposase/IS protein [Yersinia pestis CO92]gi|16121706|ref|NP_405019.1| transposase/IS protein [Y ersinia pestis CO92]gi|16121792|ref|NP_405105.1| transposase/IS protein [Yersinia pestis CO92]gi|16121890|r ef|NP_405203.1| transposase/IS protein [Yersinia pestis CO92]gi|16121951|ref|NP_405264.1| transposase/IS pr otein [Yersinia pestis CO92]gi|16121988|ref|NP_405301.1| transposase/IS protein [Yersinia pestis CO92]gi|16 122008|ref|NP_405321.1| transposase/IS protein [Yersinia pestis CO92]gi|16122148|ref|NP_405461.1| transposa se/IS protein [Yersinia pestis CO92]gi|16122266|ref|NP_405579.1| transposase/IS protein [Yersinia pestis CO 92]gi|16122324|ref|NP_405637.1| transposase/IS protein [Yersinia pestis CO92]gi|16122408|ref|NP_405721.1| t ransposase/IS protein [Yersinia pestis CO92]gi|16122588|ref|NP_405901.1| transposase/IS protein [Yersinia p estis CO92]gi|16122620|ref|NP_405933.1| transposase/IS protein [Yersinia pestis CO92]gi|16122738|ref|NP_406 051.1| transposase/IS protein [Yersinia pestis CO92]gi|16122852|ref|NP_406165.1| transposase/IS protein [Ye rsinia pestis CO92]gi|16122926|ref|NP_406239.1| transposase/IS protein [Yersinia pestis CO92]gi|16123007|re f|NP_406320.1| transposase/IS protein [Yersinia pestis CO92]gi|16123118|ref|NP_406431.1| transposase/IS pro tein [Yersinia pestis CO92]gi|16123368|ref|NP_406681.1| transposase/IS protein [Yersinia pestis CO92]gi|161 23410|ref|NP_406723.1| transposase/IS protein [Yersinia pestis CO92]gi|16123439|ref|NP_406752.1| transposas e/IS protein [Yersinia pestis CO92]gi|16123584|ref|NP_406897.1| transposase/IS protein [Yersinia pestis CO9 2]gi|16123688|ref|NP_407001.1| transposase/IS protein [Yersinia pestis CO92]gi|16123734|ref|NP_407047.1| tr ansposase/IS protein [Yersinia pestis CO92]gi|16123839|ref|NP_407152.1| transposase/IS protein [Yersinia pe stis CO92]gi|16123892|ref|NP_407205.1| transposase/IS protein [Yersinia pestis CO92]gi|16123908|ref|NP_4072 21.1| transposase/IS protein [Yersinia pestis CO92]gi|16124133|ref|NP_407446.1| transposase/IS protein [Yer sinia pestis CO92]gi|22123963|ref|NP_667386.1| transposase/IS protein [Yersinia pestis KIM]gi|22124031|ref| NP_667454.1| transposase/IS protein [Yersinia pestis KIM]gi|22124203|ref|NP_667626.1| transposase/IS protei n [Yersinia pestis KIM]gi|22124372|ref|NP_667795.1| transposase/IS protein [Yersinia pestis KIM]gi|22124391 |ref|NP_667814.1| transposase/IS protein [Yersinia pestis KIM]gi|22124420|ref|NP_667843.1| transposase/IS p rotein [Yersinia pestis KIM]gi|22124556|ref|NP_667979.1| transposase/IS protein [Yersinia pestis KIM]gi|221 24665|ref|NP_668088.1| transposase/IS protein [Yersinia pestis KIM]gi|22124814|ref|NP_668237.1| transposase /IS protein [Yersinia pestis KIM]gi|22124844|ref|NP_668267.1| transposase/IS protein [Yersinia pestis KIM]g i|22124913|ref|NP_668336.1| transposase/IS protein [Yersinia pestis KIM]gi|22125025|ref|NP_668448.1| transp osase/IS protein [Yersinia pestis KIM]gi|22125118|ref|NP_668541.1| transposase/IS protein [Yersinia pestis KIM]gi|22125219|ref|NP_668642.1| transposase/IS protein [Yersinia pestis KIM]gi|22125447|ref|NP_668870.1| t ransposase/IS protein [Yersinia pestis KIM]gi|22125565|ref|NP_668988.1| transposase/IS protein [Yersinia pe stis KIM]gi|22125833|ref|NP_669256.1| transposase/IS protein [Yersinia pestis KIM]gi|22125913|ref|NP_669336 .1| transposase/IS protein [Yersinia pestis KIM]gi|22126032|ref|NP_669455.1| transposase/IS protein [Yersin ia pestis KIM]gi|22126111|ref|NP_669534.1| transposase/IS protein [Yersinia pestis KIM]gi|22126227|ref|NP_6 69650.1| transposase/IS protein [Yersinia pestis KIM]gi|22126294|ref|NP_669717.1| transposase/IS protein [Y ersinia pestis KIM]gi|22126458|ref|NP_669881.1| transposase/IS protein [Yersinia pestis KIM]gi|22126621|ref |NP_670044.1| transposase/IS protein [Yersinia pestis KIM]gi|22126672|ref|NP_670095.1| transposase/IS prote in [Yersinia pestis KIM]gi|22126967|ref|NP_670390.1| transposase/IS protein [Yersinia pestis KIM]gi|2212702 6|ref|NP_670449.1| transposase/IS protein [Yersinia pestis KIM]gi|22127088|ref|NP_670511.1| transposase/IS protein [Yersinia pestis KIM]gi|22127284|ref|NP_670707.1| transposase/IS protein [Yersinia pestis KIM]gi|22 127489|ref|NP_670912.1| transposase/IS protein [Yersinia pestis KIM]gi|22127607|ref|NP_671030.1| transposas e/IS protein [Yersinia pestis KIM]gi|22127670|ref|NP_671093.1| transposase/IS protein [Yersinia pestis KIM] gi|22127690|ref|NP_671113.1| transposase/IS protein [Yersinia pestis KIM]gi|22127900|ref|NP_671323.1| trans posase/IS protein [Yersinia pestis KIM]gi|31795384|ref|NP_857837.1| transposase/IS protein [Yersinia pestis KIM]gi|31795462|ref|NP_857912.1| transposase/IS protein [Yersinia pestis KIM]gi|32470047|ref|NP_862989.1| putative ATP-binding protein [Escherichia coli]gi|45439896|ref|NP_991435.1| transposase/IS protein [Yersini a pestis biovar Microtus str. 91001]gi|45439948|ref|NP_991487.1| transposase/IS protein [Yersinia pestis bi ovar Microtus str. 91001]gi|45440109|ref|NP_991648.1| transposase/IS protein [Yersinia pestis biovar Microt us str. 91001]gi|45440257|ref|NP_991796.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 910 01]gi|45440297|ref|NP_991836.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 91001]gi|45440 401|ref|NP_991940.1| transposase/IS protein [Yersinia pestis biovar Microtus str. 91001]
But to tell the honest true, this is not a problem - this is less than 10% of only one of many other problems. This particular protein (gi number: 10955428) has over three hundred other gi numbers in its header in non-redundant database from NCBI, which apparently made cd-hit stand still in amusement of such a lengthy description for weeks. Quick fix in Perl, and now the clustering is going to be finished within few hours, as it should.

A lot of programs choke on excessively long FASTA headers. Readseq is one example that comes to mind. I routinely pass FASTA-formatted database sequences through a ‘cut’ filter before passing them on to sequence analysis programs
I must admit it happened to me first time. Last time I did some clustering (maybe a year ago) I didn’t notice any issues but on the other hand new things appear in the GenBank much faster than before.