PVS-Studio visiting Apache Hive

Over the past ten years, the open source movement has been one of the key factors in the development of the IT industry and an important part of it. The role and place of open source is not only enhanced by the growth of quantitative indicators, but there is also a change in its qualitative positioning in the IT market as a whole. Without sitting idly by, the brave team of PVS-Studio actively contributes to consolidating the positions of open source projects, finding hidden bugs in huge thicknesses of code bases and offering free licenses for such projects. This article is no exception! Today we will talk about Apache Hive! Report received - there is something to see!

About PVS-Studio

PVS-Studio static code analyzer has existed for more than 10 years in the IT market and is a multifunctional and easily implemented software solution. At the moment, the analyzer supports C, C ++, C #, Java languages and works on Windows, Linux and macOS platforms.

PVS-Studio is a paid B2B solution and is used by a large number of teams in various companies. If you want to see what the analyzer is capable of, then download the distribution kit and request a trial key here .

If you are an open source geek or, for example, you are a student, then you can use one of the free licensing options of PVS-Studio.

About Apache Hive

The volume of data in recent years is growing at a high speed. Standard databases can no longer maintain operability at such a rate of growth of information volume, which served as the emergence of the term Big Data and everything that is connected with it (processing, storage and all the ensuing actions with such data volumes).

Currently, Apache Hadoop is considered one of the fundamental technologies of Big Data. The main objectives of this technology are the storage, processing and management of large volumes of data. The main components of the framework are Hadoop Common, HDFS , Hadoop MapReduce , Hadoop YARN . Over time, a whole ecosystem of related projects and technologies has formed around Hadoop, many of which developed initially as part of the project, and subsequently became independent. One of these projects is Apache Hive .

Apache Hive is a distributed data warehouse. It manages the data stored in HDFS and provides an SQL-based query language (HiveQL) for working with this data. For a detailed acquaintance with this project, you can study the information here .

About analysis

The sequence of steps for the analysis is quite simple and does not require much time:

Got Apache Hive with GitHub ;
I used the instructions for starting the Java analyzer and started the analysis;
I got the analyzer report, analyzed it and highlighted interesting cases.

Analysis results: 1456 warnings of the High and Medium confidence level (602 and 854, respectively) were issued for 6500+ files.

Not all warnings are errors. This is a normal situation, and before regular use of the analyzer, its configuration is required. Then we can expect a fairly low percentage of false positives ( example ).

Among the warnings, 407 warnings (177 High and 230 Medium) per test files were not considered. Diagnostic rule V6022 was not considered (it is difficult to separate erroneous situations from correct ones in an unfamiliar code), which had as many as 482 warnings. V6021 with 179 warnings was not considered either.

In the end, all the same, a sufficient number of warnings remained. And since I did not configure the analyzer, among them again there are false positives. It makes no sense to describe a large number of warnings in an article :). Consider what caught my eye and seemed interesting.

Predetermined conditions

Diagnostic rule V6007 is the record holder among all remaining analyzer warnings. Just over 200 warnings !!! Some, like, harmless, some are suspicious, while others are real mistakes! Let's look at some of them.

V6007 Expression 'key.startsWith ("hplsql.")' Is always true. Exec.java (675)

void initOptions() { .... if (key == null || value == null || !key.startsWith("hplsql.")) { // <= continue; } else if (key.compareToIgnoreCase(Conf.CONN_DEFAULT) == 0) { .... } else if (key.startsWith("hplsql.conn.init.")) { .... } else if (key.startsWith(Conf.CONN_CONVERT)) { .... } else if (key.startsWith("hplsql.conn.")) { .... } else if (key.startsWith("hplsql.")) { // <= .... } }

A fairly long if-else-if construct! The analyzer swears at the last if (key.startsWith ("hplsql.")) , Indicating its truth if the program reaches this code fragment. Indeed, if you look at the very beginning of the if-else-if construct, then the check has already been completed. And if our line did not start with the substring “hplsql.” , Then the code execution immediately jumped to the next iteration.

V6007 Expression 'columnNameProperty.length () == 0' is always false. OrcRecordUpdater.java (238)

 private static TypeDescription getTypeDescriptionFromTableProperties(....) { .... if (tableProperties != null) { final String columnNameProperty = ....; final String columnTypeProperty = ....; if ( !Strings.isNullOrEmpty(columnNameProperty) && !Strings.isNullOrEmpty(columnTypeProperty)) { List<String> columnNames = columnNameProperty.length() == 0 ? new ArrayList<String>() : ....; List<TypeInfo> columnTypes = columnTypeProperty.length() == 0 ? new ArrayList<TypeInfo>() : ....; .... } } } .... }

Comparing string lengths of columnNameProperty with zero will always return false . This is because our comparison is under test ! Strings.isNullOrEmpty (columnNameProperty) . If the state of the program reaches our condition in question, then the columnNameProperty line is guaranteed to be non-zero and not empty.

This is true for the columnTypeProperty line. Warning line below:

V6007 Expression 'columnTypeProperty.length () == 0' is always false. OrcRecordUpdater.java (239)

V6007 Expression 'colOrScalar1.equals ("Column")' is always false. GenVectorCode.java (3469)

 private void generateDateTimeArithmeticIntervalYearMonth(String[] tdesc) throws Exception { .... String colOrScalar1 = tdesc[4]; .... String colOrScalar2 = tdesc[6]; .... if (colOrScalar1.equals("Col") && colOrScalar1.equals("Column")) // <= { .... } else if (colOrScalar1.equals("Col") && colOrScalar1.equals("Scalar")) { .... } else if (colOrScalar1.equals("Scalar") && colOrScalar1.equals("Column")) { .... }

}

Here's a trivial copy-paste. It turned out that the colOrScalar1 line should be equal to different values at the same time, and this is impossible. Apparently, the variable colOrScalar1 should be checked on the left, and colOrScalar2 on the right.

More similar warnings in the lines below:

V6007 Expression 'colOrScalar1.equals ("Scalar")' is always false. GenVectorCode.java (3475)
V6007 Expression 'colOrScalar1.equals ("Column")' is always false. GenVectorCode.java (3486)

As a result, no actions in the if-else-if construct will ever be executed.

Some other warnings for the V6007 :

V6007 Expression 'characters == null' is always false. RandomTypeUtil.java (43)
V6007 Expression 'writeIdHwm> 0' is always false. TxnHandler.java (1603)
V6007 Expression 'fields.equals ("*")' is always true. Server.java (983)
V6007 Expression 'currentGroups! = Null' is always true. GenericUDFCurrentGroups.java (90)
V6007 Expression 'this.wh == null' is always false. New returns not-null reference. StorageBasedAuthorizationProvider.java (93), StorageBasedAuthorizationProvider.java (92)
and so on...

NPE

V6008 Potential null dereference of 'dagLock'. QueryTracker.java (557), QueryTracker.java (553)

 private void handleFragmentCompleteExternalQuery(QueryInfo queryInfo) { if (queryInfo.isExternalQuery()) { ReadWriteLock dagLock = getDagLock(queryInfo.getQueryIdentifier()); if (dagLock == null) { LOG.warn("Ignoring fragment completion for unknown query: {}", queryInfo.getQueryIdentifier()); } boolean locked = dagLock.writeLock().tryLock(); ..... } }

Caught the zero object, pledged and ... continued to work. This leads to the fact that after checking the object dereferencing of the zero object occurs. Sadness!

Most likely, in the case of a null reference, you should immediately exit the function or throw some special exception.

V6008 Null dereference of 'buffer' in function 'unlockSingleBuffer'. MetadataCache.java (410), MetadataCache.java (465)

 private boolean lockBuffer(LlapBufferOrBuffers buffers, ....) { LlapAllocatorBuffer buffer = buffers.getSingleLlapBuffer(); if (buffer != null) { // <= return lockOneBuffer(buffer, doNotifyPolicy); } LlapAllocatorBuffer[] bufferArray = buffers.getMultipleLlapBuffers(); for (int i = 0; i < bufferArray.length; ++i) { if (lockOneBuffer(bufferArray[i], doNotifyPolicy)) continue; for (int j = 0; j < i; ++j) { unlockSingleBuffer(buffer, true); // <= } .... } .... } .... private void unlockSingleBuffer(LlapAllocatorBuffer buffer, ....) { boolean isLastDecref = (buffer.decRef() == 0); // <= if (isLastDecref) { .... } }

And again a potential NPE. If the program reaches the unlockSingleBuffer method, the buffer object will be zero. Let's say it happened! Let's look at the unlockSingleBuffer method and immediately on the first line we see that our object is dereferenced. Here we are!

Didn't follow the shift

V6034 Shift by the value of 'bitShiftsInWord - 1' could be inconsistent with the size of type: 'bitShiftsInWord - 1' = [-1 ... 30]. UnsignedInt128.java (1791)

 private void shiftRightDestructive(int wordShifts, int bitShiftsInWord, boolean roundUp) { if (wordShifts == 0 && bitShiftsInWord == 0) { return; } assert (wordShifts >= 0); assert (bitShiftsInWord >= 0); assert (bitShiftsInWord < 32); if (wordShifts >= 4) { zeroClear(); return; } final int shiftRestore = 32 - bitShiftsInWord; // check this because "123 << 32" will be 123. final boolean noRestore = bitShiftsInWord == 0; final int roundCarryNoRestoreMask = 1 << 31; final int roundCarryMask = (1 << (bitShiftsInWord - 1)); // <= .... }

Possible offset by -1. If, for example, wordShifts == 3 and bitShiftsInWord == 0 come to the input of the method in question, then 1 << -1 will occur in the specified line. Is this planned?

V6034 Shift by the value of 'j' could be inconsistent with the size of type: 'j' = [0 ... 63]. IoTrace.java (272)

 public void logSargResult(int stripeIx, boolean[] rgsToRead) { .... for (int i = 0, valOffset = 0; i < elements; ++i, valOffset += 64) { long val = 0; for (int j = 0; j < 64; ++j) { int ix = valOffset + j; if (rgsToRead.length == ix) break; if (!rgsToRead[ix]) continue; val = val | (1 << j); // <= } .... } .... }

In the specified line, the variable j can take a value in the range [0 ... 63]. Because of this, the calculation of the value of val in the loop may not happen as the developer intended. In the expression (1 << j), the unit is of type int , and, shifting it from 32 or more, we go beyond the bounds of the permissible. To remedy the situation, you must write ((long) 1 << j) .

Enthusiastic about logging

V6046 Incorrect format. A different number of format items is expected. Arguments not used: 1, 2. StatsSources.java (89)

 private static ImmutableList<PersistedRuntimeStats> extractStatsFromPlanMapper (....) { .... if (stat.size() > 1 || sig.size() > 1) { StringBuffer sb = new StringBuffer(); sb.append(String.format( "expected(stat-sig) 1-1, got {}-{} ;", // <= stat.size(), sig.size() )); .... } .... if (e.getAll(OperatorStats.IncorrectRuntimeStatsMarker.class).size() > 0) { LOG.debug( "Ignoring {}, marked with OperatorStats.IncorrectRuntimeStatsMarker", sig.get(0) ); continue; } .... }

When formatting a string through String.format (), the developer confused the syntax. Bottom line: the passed parameters did not get into the resulting string. I can assume that in the previous task the developer worked on logging, from where he borrowed the syntax.

Stole the exception

V6051 The use of the 'return' statement in the 'finally' block can lead to the loss of unhandled exceptions. ObjectStore.java (9080)

 private List<MPartitionColumnStatistics> getMPartitionColumnStatistics(....) throws NoSuchObjectException, MetaException { boolean committed = false; try { .... /*some actions*/ committed = commitTransaction(); return result; } catch (Exception ex) { LOG.error("Error retrieving statistics via jdo", ex); if (ex instanceof MetaException) { throw (MetaException) ex; } throw new MetaException(ex.getMessage()); } finally { if (!committed) { rollbackTransaction(); return Lists.newArrayList(); } } }

Returning something from the finally block is a very bad practice, and with this example we will see this.

In the try block, the request is formed and the storage is accessed. The committed variable defaults to false and changes its state only after all successfully completed actions in the try block. This means that if an exception occurs, our variable will always be false . Catch block caught an exception, slightly corrected and threw further. And when the time comes for the finally block , execution enters a condition from which we return the empty list outward. What does this return cost us? But it's worth the fact that all caught exceptions will never be thrown out and handled in an appropriate way. All those exceptions that are indicated in the method signature will never be thrown and are simply confusing.

A similar warning:

V6051 The use of the 'return' statement in the 'finally' block can lead to the loss of unhandled exceptions. ObjectStore.java (808)

... other

V6009 Function 'compareTo' receives an odd argument. An object 'o2.getWorkerIdentity ()' is used as an argument to its own method. LlapFixedRegistryImpl.java (244)

 @Override public List<LlapServiceInstance> getAllInstancesOrdered(....) { .... Collections.sort(list, new Comparator<LlapServiceInstance>() { @Override public int compare(LlapServiceInstance o1, LlapServiceInstance o2) { return o2.getWorkerIdentity().compareTo(o2.getWorkerIdentity()); // <= } }); .... }

Copy-paste, carelessness, rush and many other reasons to make this stupid mistake. When checking open source projects, errors of this kind are quite common. There is even a whole article about it.

V6020 Divide by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java (265)

 public static long divideUnsignedLong(long dividend, long divisor) { if (divisor < 0L) { /*some comments*/ return (compareUnsignedLong(dividend, divisor)) < 0 ? 0L : 1L; } if (dividend >= 0) { // Both inputs non-negative return dividend / divisor; // <= } else { .... } }

Everything is quite simple here. A number of checks did not warn against dividing by 0.

More warnings:

V6020 Mod by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java (309)
V6020 Divide by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java (276)
V6020 Divide by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java (312)

V6030 The method located to the right of the '|' operator will be called regardless of the value of the left operand. Perhaps, it is better to use '||'. OperatorUtils.java (573)

 public static Operator<? extends OperatorDesc> findSourceRS(....) { .... List<Operator<? extends OperatorDesc>> parents = ....; if (parents == null | parents.isEmpty()) { // reached end eg TS operator return null; } .... }

Instead of the logical operator || wrote a bit operator |. This means that the right side will be executed regardless of the result of the left side. Such a typo, in the case of parents == null , will immediately lead to an NPE in the next logical subexpression.

V6042 The expression is checked for compatibility with type 'A' but is cast to type 'B'. VectorColumnAssignFactory.java (347)

 public static VectorColumnAssign buildObjectAssign(VectorizedRowBatch outputBatch, int outColIndex, PrimitiveCategory category) throws HiveException { VectorColumnAssign outVCA = null; ColumnVector destCol = outputBatch.cols[outColIndex]; if (destCol == null) { .... } else if (destCol instanceof LongColumnVector) { switch(category) { .... case LONG: outVCA = new VectorLongColumnAssign() { .... } .init(.... , (LongColumnVector) destCol); break; case TIMESTAMP: outVCA = new VectorTimestampColumnAssign() { .... }.init(...., (TimestampColumnVector) destCol); // <= break; case DATE: outVCA = new VectorLongColumnAssign() { .... } .init(...., (LongColumnVector) destCol); break; case INTERVAL_YEAR_MONTH: outVCA = new VectorLongColumnAssign() { .... }.init(...., (LongColumnVector) destCol); break; case INTERVAL_DAY_TIME: outVCA = new VectorIntervalDayTimeColumnAssign() { .... }.init(...., (IntervalDayTimeColumnVector) destCol);// <= break; default: throw new HiveException(....); } } else if (destCol instanceof DoubleColumnVector) { .... } .... else { throw new HiveException(....); } return outVCA; }

The classes in question are LongColumnVector extends ColumnVector and TimestampColumnVector extends ColumnVector . Checking our destCol object for LongColumnVector ownership clearly tells us that the object of this class will be inside the conditional statement. Despite this, we are casting to TimestampColumnVector ! As you can see, these classes are different, not counting their common parent. As a result - ClassCastException .

All the same can be said about type conversion to IntervalDayTimeColumnVector :

V6042 The expression is checked for compatibility with type 'A' but is cast to type 'B'. VectorColumnAssignFactory.java (390)

V6060 The 'var' reference was utilized before it was verified against null. Var.java (402), Var.java (395)

 @Override public boolean equals(Object obj) { if (getClass() != obj.getClass()) { // <= return false; } Var var = (Var)obj; if (this == var) { return true; } else if (var == null || // <= var.value == null || this.value == null) { return false; } .... }

Strange comparison of var object with null after dereferencing has occurred. In this context, var and obj are the same object ( var = (Var) obj ). Checking for null implies that a null object may come. And in the case of calling equals (null), we get immediately on the first line NPE instead of the expected false . Alas, there is a check, but not there.

Similar suspicious moments using the object before the check takes place:

V6060 The 'value' reference was utilized before it was verified against null. ParquetRecordReaderWrapper.java (168), ParquetRecordReaderWrapper.java (166)
V6060 The 'defaultConstraintCols' reference was utilized before it was verified against null. HiveMetaStore.java (2539), HiveMetaStore.java (2530)
V6060 The 'projIndxLst' reference was utilized before it was verified against null. RelOptHiveTable.java (683), RelOptHiveTable.java (682)
V6060 The 'oldp' reference was utilized before it was verified against null. ObjectStore.java (4343), ObjectStore.java (4339)
and so on...

Conclusion

Anyone who was a little interested in Big Data, he hardly missed the significance of Apache Hive. The project is popular and large enough, and in its composition has more than 6500 source code files (* .java). The code was written by many developers for many years and, as a result, the static analyzer has something to find. This once again confirms that static analysis is extremely important and useful in the development of medium and large projects!

Note. Such one-time checks demonstrate the capabilities of a static code analyzer, but are a completely wrong way to use it. This idea is presented in more detail here and here . Use the analysis regularly!

When checking the Hive, a sufficient number of defects and suspicious moments were detected. If this article draws the attention of the Apache Hive development team, we will be pleased to contribute to this difficult task.

It is impossible to imagine Apache Hive without Apache Hadoop, so it is likely that the unicorn from PVS-Studio will look there too. But that's all for today, but for now download the analyzer and check your own projects.

If you want to share this article with an English-speaking audience, then please use the link to the translation: Maxim Stefanov. PVS-Studio Visits Apache Hive .

All Articles