Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

TPUMP How It Works

TPump is similar to MLoad, Tpump edits Teradata tables by processing insert, updates, and deletes.

TPump performs updates on the Teradata Database in a synchronous manner. Changes are sent in conventional CLIv2 parcels and applied immediately to the target table(s). To improve its efficiency, TPump builds multiple Statement requests and provides the serialize option to help reduce locking overhead.
TPump, on the other hand, does better on relatively low volumes of changes because there is no temporary table overhead. TPump becomes expensive for large volumes of data because multiple updates to a physical data block will most likely result in multiple reads and writes of    the block.

The most important technique used by TPump to improve performance over MultiLoad is the multiple statement request. Placing more statements in a single request is beneficial for two reasons. First, it reduces network overhead because large messages are more efficient than small ones. Secondly, (in ROBUST mode) it reduces TPump recovery overhead, which amounts to one extra database row written for each request. TPump automatically packs multiple statements into a request based upon the PACK specification in the BEGIN LOAD command.

Macros: TPump uses macros to efficiently modify tables, rather than using the actual DML commands. The technique of changing statements into equivalent macros before beginning the job greatly improves performance. Specifically, the benefits of using macros are:
1. the size of network (and channel) messages sent to the RDBMS by TPump are reduced.
2. RDBMS parsing engine overhead is reduced because the execution plans (or “steps”) for macros are cached and re-used. This eliminates “normal” parser handling, where each request sent by TPump is planned and optimized. Because the space required by macros is negligible, the only issue regarding the macros is where the macros are placed in the RDBMS. The macros are put into the database that contains the restart log table or the database specified using the MACRODB keyword in the  BEGIN LOAD command.

Locking and Transactional Logic
   In contrast to MultiLoad, TPump uses conventional row hash locking which allows for some amount of concurrent read and write access to its target tables. At any point  TPump can be stopped and the target tables are fully accessible. Note however, that if TPump is stopped, depending on the nature of the update process, it may mean that the “relational” integrity of the data is impaired.
   This differs from MultiLoad, which operates as a single logical update to one or more target tables. Once MultiLoad goes into phase two of its logic, the job is
   “essentially” irreversible and the (entire set of) table(s) is locked for write access until it completes.

Recovery Logic and Overhead
   TPump, in “ROBUST mode”, writes one database row in the log restart table for every request that it issues. This collection of rows in the restart log table can be referred to as the Request Log. Because a request is guaranteed by the RDBMS to either completely finish or completely rollback, the request log will always accurately reflect the completion status of a TPump import. Thus, the request log overhead for restart logic decreases as the number of statements packed per request increases.    TPump also allows you to specify a checkpoint interval. During the checkpoint process TPump flushes all pending changes from the import file to the database and also cleans out the request log. The larger the checkpoint interval, the larger the request log (and its table) is  going to grow. Upon an unexpected restart, TPump scans the import data source along with the request log in order to re-execute the statements not found in the request log.
   TPump in “SIMPLE (non-ROBUST) mode”, provides basic checkpoints. If a restart occurs between checkpoints, then some requests will likely be reprocessed. This is adequate protection under some circumstances.

Serialization of Changes
   In certain uses of TPump or MultiLoad it is possible to have multiple changes to one row in the same job. For instance, the row may be inserted and then updated during the batch job or it may be updated and then deleted. In any case, the correct ordering of these operations is obviously very important. MultiLoad automatically guarantees that this ordering of  operations is maintained correctly. By using the serialization feature, TPump can also guarantee that this ordering of operations is maintained correctly, but it requires some small amount of scripting work and a small amount of utility overhead.

Resource Usage and Limitations
   A feature unique to TPump is the ability to constrain run-time resource usage through the statement rate feature. TPump gives you control over the rate per minute at which statements are sent to the RDBMS and the statement rate correlates directly to resource usage on both the client and in the RDBMS. The statement rate can be controlled in two ways, either dynamically while the job is running, or it can be scripted into the job with the RATE keyword on the BEGIN LOAD command. Dynamic control over the statement rate is provided by updates to a table on the RDBMS.

Operating Modes :
  • Interactive – Interactive processing involves the more or less continuous participation of the user.
   • Batch – Batch programs process data in groups as per scheduled operations, typically in a separate operation,      rather than interactively or in real time.



This post first appeared on Teradata SQL Reference, please read the originial post: here

Share the post

TPUMP How It Works

×

Subscribe to Teradata Sql Reference

Get updates delivered right to your inbox!

Thank you for your subscription

×