[kepler-dev] [Bug 4764] ProvenanceRecorder.changeExecuted slow after workflow run

bugzilla-daemon at ecoinformatics.org bugzilla-daemon at ecoinformatics.org
Fri Feb 12 15:41:06 PST 2010


http://bugzilla.ecoinformatics.org/show_bug.cgi?id=4764

--- Comment #9 from Oliver Soong <soong at nceas.ucsb.edu> 2010-02-12 15:41:06 PST ---
That's correct and expected.  recordContainerContents is initiated from
changeExecuted, and it's recordContainerContents that is called thousands of
times per call to changeExecuted.

I'm not convinced there's anything obviously wrong with the way this works, but
perhaps we can think of ways to make this less of a hog.  First, it's
apparently alright for provenance recording to be delayed until the first run,
so there doesn't seem to be an obvious reason why we have to make a full record
for those 3 ChangeRequests that changing targetYear triggers.  There are
probably not-so-obvious reasons that I'm not aware of.

Second, I think a single provenance recording is checking and updating both the
RegEntity cache and the provenance HSQLDB once per NamedObj (director, actor,
port, parameter, relation, annotation, etc.) in the workflow, which in the case
of tpc09 is apparently over 3000.  Given the profiling results, I don't think
the RegEntity cache/hash map is a problem, but I suspect we could make updating
the provenance HSQLDB more efficient.  My just-enough-to-be-dangerous
understanding of DB stuff says they're usually optimized to operate in bulk, so
perhaps it would be faster (but not necessarily easier) to build a list of
entries in the workflow, make a single duplicates check query, then make a
single update.  There are probably reasons why this is a horrible idea, but I
haven't thought of them yet.

-- 
Configure bugmail: http://bugzilla.ecoinformatics.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.


More information about the Kepler-dev mailing list