A common data warehousing technique I’ve seen people employ goes like this:
\ 1.\ They use a trigger to maintain a LAST_UPDATED column in the source table, much like the method described in the previous chapter in the “Optimistic Locking” section.
\ 2.\ To initially populate a data warehouse table, they remember what time it is right now by selecting out SYSDATE on the source system. For example, suppose it is exactly 9:00 a.m. right now.
\ 3.\ They then pull all of the rows from the transactional system—a full SELECT * FROM TABLE—to get the data warehouse initially populated.
\ 4.\ To refresh the data warehouse, they remember what time it is right now again. For example, suppose an hour has gone by—it is now 10:00 a.m. on the source system. They will remember that fact. They then pull all changed records since 9:00 a.m. (the moment before they started the first pull) and merge them in.
Note This technique may pull the same record twice in two consecutive refreshes. This is unavoidable due to the granularity of the clock. A MERGE operation will not be affected by this (i.e., update existing record in the data warehouse or insert a new record).
They believe that they now have all of the records in the data warehouse that were modified since they did the initial pull. They may actually have all of the records, but just as likely they may not. This technique does work on some other databases—ones thatemploy a locking system whereby reads are blocked by writes and vice versa. But in a system where you have nonblocking reads, the logic is flawed.
To see the flaw in this example, all we need to do is assume that at 9:00 a.m. there was at least one open, uncommitted transaction. At 8:59:30 a.m., it had updated a row in the table we were to copy. At 9:00 a.m., when we started pulling the data and thus reading the data in this table, we would not see the modifications to that row; we would see the last committed version of it. If it was locked when we got to it in our query, we would read around the lock. If it was committed by the time we got to it, we would still read around it since read consistency permits us to read only data that was committed in the database when our statement began. We would not read that new version of the row during the 9:00 a.m. initial pull, nor would we read the modified row during the 10:00 a.m. refresh. The reason? The 10:00 a.m. refresh would only pull records modified since 9:00 a.m. that morning, but this record was modified at 8:59:30 a.m. We would never pull this changed record.
In many other databases where reads are blocked by writes and a committed but inconsistent read is implemented, this refresh process would work perfectly. If at 9:00 a.m. when we did the initial pull of data, we hit that row and it was locked, we would have blocked and waited for it and read the committed version. If it were not locked, we would just read whatever was there, committed.
So, does this mean the preceding logic just cannot be used? No, it means that we need to get the “right now” time a little differently. We need to query V$TRANSACTION and find out which is the earliest of the current time and the time recorded in START_TIME column of this view. We will need to pull all records changed since the start time of the oldest transaction (or the current SYSDATE value if there are no active transactions):
SQL> select nvl( min(to_date(start_time,’mm/dd/rr hh24:mi:ss’)),sysdate) from v$transaction;
Note The preceding query works regardless of the presence of any data in V$TRANSACTION. That is, even if V$TRANSACTION is empty (because there are no transactions currently), this query returns a record. A query that has an aggregate with no WHERE clause always returns at least one row and at most one row.
In this example, that would be 8:59:30 a.m. when the transaction that modified the row started. When we go to refresh the data at 10:00 a.m., we pull all of the changes that had occurred since that time; when we merge these into the data warehouse, we’ll have everything we need.