[SEEK-Taxon] some concerns about the mammal data uploaded by Trevor

xianhual@email.unc.edu xianhual at email.unc.edu
Tue Mar 1 12:21:24 PST 2005


Hello:

I have developed an algorithm to automatically clean up the mammal data
according to following steps:

step 1: Clean up publications: Parsing out pages & removing duplications
step 2: Update original concepts in 'MSWOriginalTC.xml' using inofrmation from
step 1.


Two new files have been created.

1. 'MSWOriginalTC_new.xml' as the clean-up version of 'MSWOriginalTC.xml'
2. 'MSWvouchersandPublications_new.xml' as the clean-up version of
'MSWvouchersandPublications.xml'

Please find the two new files in attachment and check if there is any error.


Xianhua


Quoting xianhual at email.unc.edu:

> Jessie,
>
> Yes, I'd like to spend some time cleaning up these duplications and parsing
> the page numbers out from publications. I will try to get to it as soon as
> possible.
>
> Xianhua
>
> Quoting "Kennedy, Jessie" <J.Kennedy at napier.ac.uk>:
>
>> Hi Xianhua
>>
>>>
>>> I went through the mammal data Trevor created and thank him
>>> for his great job
>>> before moving. I just found something I am not very sure with.
>>>
>>> Firstly, different pages in the same publication have been treated as
>>> different publications. See an example as following,
>>>
>>> <Publication id="MSW_PUB5473"> <PublicationSimple>Ann. Mag.
>>> Nat. Hist., ser.
>>> 8, 10:396.</PublicationSimple> </Publication>
>>> <Publication id="MSW_PUB5474"> <PublicationSimple>Ann. Mag.
>>> Nat. Hist., ser.
>>> 8, 10:397.</PublicationSimple> </Publication>
>>> <Publication id="MSW_PUB5432"> <PublicationSimple>Ann. Mag.
>>> Nat. Hist., ser.
>>> 8, 10:399.</PublicationSimple> </Publication>
>>>
>>>
>>> I wonder if it is better to treated them as one publication
>>> 'Ann. Mag. Nat.
>>> Hist., ser. 8, 10' with differenct microreference of pages -
>>> 396,397 and 399
>>> respectively. This might be the use of TCS the way it has been
>>> designed.
>>>
>> yes this was how it was desinged and it would be better - this was 
>> only Trevor trying to automate the conversion - the only way to 
>> change this would be manually based on the data format we had - so 
>> if you can - making these changes would be more accurate.... thanks
>>
>>> Additionally, there are some duplications in the publications.
>>> See examples as
>>> following,
>>>
>>> example 1:
>>>
>>> <Publication id="MSW_PUB5479"> <PublicationSimple>Ann. Sci. Nat. Zool.
>>> (Paris), ser. 5, 7:375.</PublicationSimple> </Publication>
>>> <Publication id="MSW_PUB5480"> <PublicationSimple>Ann. Sci. Nat. Zool.
>>> (Paris), ser. 5, 7:375.</PublicationSimple> </Publication>
>>>
>>> example 2:
>>>
>>> <Publication id="MSW_PUB5481"> <PublicationSimple>David, Nouv.
>>> Arch. Mus.
>>> Hist. Nat. Paris, Bull. for 1871, 7(4):92 [1872].</PublicationSimple>
>>> </Publication>
>>> <Publication id="MSW_PUB5482"> <PublicationSimple>David, Nouv.
>>> Arch. Mus.
>>> Hist. Nat. Paris, Bull. for 1871, 7(4):92 [1872].</PublicationSimple>
>>> </Publication>
>>>
>>
>> yes they shouldn't be there.......
>>
>>>
>>> Before we import the data into SEEK database for further
>>> processing, we'd
>>> better re-check the data and make it as clean as possible.
>>
>>
>> with Trevor having gone would you be able to do this? is this ok Bob?
>>
>> any problems we're happy to discuss...
>>
>> thanks,
>>
>> Jessie
>> _______________________________________________
>> seek-taxon mailing list
>> seek-taxon at ecoinformatics.org
>> http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>>
>
>
> _______________________________________________
> seek-taxon mailing list
> seek-taxon at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-taxon
>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: mammal_cleanup.zip
Type: application/x-zip-compressed
Size: 567704 bytes
Desc: not available
Url : http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/seek-taxon/attachments/20050301/3961b7ff/mammal_cleanup.bin


More information about the Seek-taxon mailing list