Today I had another case of that … total breakdown of a live environment by infrastructure defect, bad enough. Customer rushing up the database and all apps services, just to see thousands of locks and hanging transactions. Big panic, frequent restarting of databases and applications. Getting worse, finally calling me.
Told them I’ll manage that, asked them to shut down all apps until I tell them to start up. Started the DB, read the newspaper for 10 minutes, then they started calling me. Told them, that I am working on it. After 20mins of reading the paper and answering calls, asked them to start the apps. => Voila, all was well.
The answer? Just let your SMON do its job.
Ok, this initially was not exactly about performance, but the initial mistake was old Windows-like thinking: Reboot it, a reboot will activate magic self-healing powers. Fail, crashing Oracle (what the service infrastructure did in this case) is a very bad thing to do, and even worse, it “registers a mortgage” on the system: As soon as you start up an instance for this database later, it has not only to crash-recover, but ALSO to roll back all open transactions after the DB was opened. The latter was the problem in the above case: SMON kicks in late. Now the swarm of busy application bees buzzing around after such a crash, has to share resources with SMON. And has to wait for still-open row locks, spinning on them, eating more resources… and now it is a performance issue – a pointless, home made performance catastrophe.
Keep calm
Martin