durusmail: durus-users: Interesting BTree failure
Interesting BTree failure
2011-08-14
2011-08-15
2011-08-15
2011-08-15
2011-08-15
2011-08-15
2011-08-15
2011-08-15
2011-08-15
2011-08-15
Interesting BTree failure: SOLVED
2011-08-17
2011-08-17
Interesting BTree failure
David Hess
2011-08-15
On Aug 15, 2011, at 9:55 AM, David Binger wrote:

> On Aug 15, 2011, at 10:12 AM, David Hess wrote:
>
>>
>> On Aug 15, 2011, at 8:50 AM, David Binger wrote:
>>
>>> This sounds like the offset mapping in memory is somehow out of sync with
the file.
>>> We've never seen this condition,  which is obviously bad.  I'm concerned
>>> about this and trying to figure out how it could occur.
>>
>> Yes, if the packing implementation was an issue, I would have thought we
would just have missing oids (or is that a durus_id now?) - but I also have
evidence of an object whose class is BNode trying to load state of a pickle that
was clearly associated with an application object.
>
> Yes, the oid is now called "durus_id".
>
> I just meant that the BNode class doesn't have any special state management
code
> that could be causing the problem.  I think BNode is involved only because
> that is where your actions happen to be.

Agree.

>> Based on what I did to repair the database, it appears this corruption only
affected the items that were touched when the modification to the BTree was made
during the pack (interior BNodes and the application objects). It's almost as if
there is a race problem during packing having to do with either durus_id
allocation or file writing.
>
> I would be interested in the details of the repair.

This particular BTree is an index of all instances of a particular Persistent
subclass. My repair involved using a monkey patched get_crawler that could
handle an exception in __setstate__ and keep going.

I then walked through the entire database finding all instances of my particular
class that didn't blow up when loaded and then loading them into a brand new
btree. I then deleted the old btree and packed the resulting database. Based on
the debugging I added to the monkey patched get_crawler, the magnitude of the
problem reported there matched the amount of change that went on during the
packing. Didn't confirm yet that it was exactly the changes that occurred during
the packing.

> I don't remember the details of how you are using FileStorage.
> Do you have just one thread/process using the FileStorage?  I hope so.

We are definitely one thread and one process for this FileStorage. We do use
Twisted so there's a lot of asynchronous things going on and interleaved with
each crank forward of the packing generator. We commit every 5 seconds rather
than on a transactional basis. We never have conflicts.

>>> It seems unlikely to have anything in particular to do with BTree or BNodes:
those structures
>>> are implemented in the same way as other persistent objects.  I think, from
your
>>> other message, that you are looking closely at the packing code, and that
>>> seems like a good idea.  There was a time when you talked about
>>> overriding a '_p_' method.  Does the code you are running involve
>>> any such low-level modifications?
>>
>> That's on one particular and isolated class that was not involved in this
situation. All it does is ignore the ghosting request from the cache manager
(which is safe because we never have conflicts).
>
> That does seem safe.   Is there any code in the cache manager, though, that
> calls the ghosting function and then takes some action based on the assumption
> that the ghosting worked?

The Cache.shrink implementation didn't seem to care - but I'm not that familiar
enough with exactly how that code works to be sure.

>>> In the current code tree, File.write() ends with self.file.flush() under all
conditions.
>>> Is that the case in the code you are running?
>>
>> Here's the implementation we are running from on Durus 3.7:
>>
>>     def write(self, s):
>>         self.obtain_lock()
>>         self.file.write(s)
>
> The new code adds self.file.flush().  Without it, certain file systems
> lose track of the correct end of the file, so a subsequent seek to the end
> of the file ends up in an incorrect place.
> In a pack, an offset is written into the header of the file and then
> there is a seek to the end of the file.  It is important for
> this to work correctly.  It could possibly be the reason
> for the trouble you have seen.

Our platform is Ubuntu 10.04 LTS and this is an ext3 file system.

>> I noticed that DurusWorks 1.0 has been released but the CHANGES.txt files are
gone. Should we upgrade to DurusWorks 1.0 instead of 3.8? It looks like Durus is
no longer distributed separately?
>
> Yes, DurusWorks is a new distribution that includes Durus.
> I have not yet sent out an announcement.  Sorry.
> I'll start a new CHANGES.txt file in the next release.
> I'd recommend downloading it and diffing the durus package
> against the one you have.  This code is not changed
> much, so I hope that is not too much trouble.
> I would upgrade.

Ok, we'll head that direction.

Dave

------
David K. Hess
877.343.4947 x114
dhess@fishtechnology.com

reply