Over the past ten years, the open source movement has been one of the key factors in the development of the IT industry and an important part of it. The role and place of open source is not only enhanced by the growth of quantitative indicators, but there is also a change in its qualitative positioning in the IT market as a whole. Without sitting idly by, the brave team of PVS-Studio actively contributes to consolidating the positions of open source projects, finding hidden bugs in huge thicknesses of code bases and offering free licenses for such projects. This article is no exception! Today we will talk about Apache Hive! Report received - there is something to see!
About PVS-Studio
PVS-Studio static code analyzer has existed for more than 10 years in the IT market and is a multifunctional and easily implemented software solution. At the moment, the analyzer supports C, C ++, C #, Java languages and works on Windows, Linux and macOS platforms.
PVS-Studio is a paid B2B solution and is used by a large number of teams in various companies. If you want to see what the analyzer is capable of, then download the distribution kit and request a trial key
here .
If you are an open source geek or, for example, you are a student, then you can use one of the free licensing
options of PVS-Studio.
About Apache Hive
The volume of data in recent years is growing at a high speed. Standard databases can no longer maintain operability at such a rate of growth of information volume, which served as the emergence of the term Big Data and everything that is connected with it (processing, storage and all the ensuing actions with such data volumes).
Currently,
Apache Hadoop is considered one of the fundamental technologies of Big Data. The main objectives of this technology are the storage, processing and management of large volumes of data. The main components of the framework are Hadoop Common,
HDFS ,
Hadoop MapReduce ,
Hadoop YARN . Over time, a whole ecosystem of related projects and technologies has formed around Hadoop, many of which developed initially as part of the project, and subsequently became independent. One of these projects is
Apache Hive .
Apache Hive is a distributed data warehouse. It manages the data stored in HDFS and provides an SQL-based query language (HiveQL) for working with this data. For a detailed acquaintance with this project, you can study the information
here .
About analysis
The sequence of steps for the analysis is quite simple and does not require much time:
- Got Apache Hive with GitHub ;
- I used the instructions for starting the Java analyzer and started the analysis;
- I got the analyzer report, analyzed it and highlighted interesting cases.
Analysis results: 1456 warnings of the High and Medium confidence level (602 and 854, respectively) were issued for 6500+ files.
Not all warnings are errors. This is a normal situation, and before regular use of the analyzer, its configuration is required. Then we can expect a fairly low percentage of false positives (
example ).
Among the warnings, 407 warnings (177 High and 230 Medium) per test files were not considered. Diagnostic rule
V6022 was not considered (it is difficult to separate erroneous situations from correct ones in an unfamiliar code), which had as many as 482 warnings.
V6021 with 179 warnings was not considered either.
In the end, all the same, a sufficient number of warnings remained. And since I did not configure the analyzer, among them again there are false positives. It makes no sense to describe a large number of warnings in an article :). Consider what caught my eye and seemed interesting.
Predetermined conditions
Diagnostic rule
V6007 is the record holder among all remaining analyzer warnings. Just over 200 warnings !!! Some, like, harmless, some are suspicious, while others are real mistakes! Let's look at some of them.
V6007 Expression 'key.startsWith ("hplsql.")' Is always true. Exec.java (675)
void initOptions() { .... if (key == null || value == null || !key.startsWith("hplsql.")) {
A fairly long if-else-if construct! The analyzer swears at the last
if (key.startsWith ("hplsql.")) , Indicating its truth if the program reaches this code fragment. Indeed, if you look at the very beginning of the if-else-if construct, then the check has already been completed. And if our line did not start with the substring
“hplsql.” , Then the code execution immediately jumped to the next iteration.
V6007 Expression 'columnNameProperty.length () == 0' is always false. OrcRecordUpdater.java (238)
private static TypeDescription getTypeDescriptionFromTableProperties(....) { .... if (tableProperties != null) { final String columnNameProperty = ....; final String columnTypeProperty = ....; if ( !Strings.isNullOrEmpty(columnNameProperty) && !Strings.isNullOrEmpty(columnTypeProperty)) { List<String> columnNames = columnNameProperty.length() == 0 ? new ArrayList<String>() : ....; List<TypeInfo> columnTypes = columnTypeProperty.length() == 0 ? new ArrayList<TypeInfo>() : ....; .... } } } .... }
Comparing string lengths of
columnNameProperty with zero will always return
false . This is because our comparison is under test
! Strings.isNullOrEmpty (columnNameProperty) . If the state of the program reaches our condition in question, then the
columnNameProperty line is guaranteed to be non-zero and not empty.
This is true for the
columnTypeProperty line. Warning line below:
- V6007 Expression 'columnTypeProperty.length () == 0' is always false. OrcRecordUpdater.java (239)
V6007 Expression 'colOrScalar1.equals ("Column")' is always false. GenVectorCode.java (3469)
private void generateDateTimeArithmeticIntervalYearMonth(String[] tdesc) throws Exception { .... String colOrScalar1 = tdesc[4]; .... String colOrScalar2 = tdesc[6]; .... if (colOrScalar1.equals("Col") && colOrScalar1.equals("Column"))
}
Here's a trivial copy-paste. It turned out that the
colOrScalar1 line should be equal to different values at the same time, and this is impossible. Apparently, the variable
colOrScalar1 should be checked on the left, and
colOrScalar2 on the right.
More similar warnings in the lines below:
- V6007 Expression 'colOrScalar1.equals ("Scalar")' is always false. GenVectorCode.java (3475)
- V6007 Expression 'colOrScalar1.equals ("Column")' is always false. GenVectorCode.java (3486)
As a result, no actions in the if-else-if construct will ever be executed.
Some other warnings for the
V6007 :
- V6007 Expression 'characters == null' is always false. RandomTypeUtil.java (43)
- V6007 Expression 'writeIdHwm> 0' is always false. TxnHandler.java (1603)
- V6007 Expression 'fields.equals ("*")' is always true. Server.java (983)
- V6007 Expression 'currentGroups! = Null' is always true. GenericUDFCurrentGroups.java (90)
- V6007 Expression 'this.wh == null' is always false. New returns not-null reference. StorageBasedAuthorizationProvider.java (93), StorageBasedAuthorizationProvider.java (92)
- and so on...
NPE
V6008 Potential null dereference of 'dagLock'. QueryTracker.java (557), QueryTracker.java (553)
private void handleFragmentCompleteExternalQuery(QueryInfo queryInfo) { if (queryInfo.isExternalQuery()) { ReadWriteLock dagLock = getDagLock(queryInfo.getQueryIdentifier()); if (dagLock == null) { LOG.warn("Ignoring fragment completion for unknown query: {}", queryInfo.getQueryIdentifier()); } boolean locked = dagLock.writeLock().tryLock(); ..... } }
Caught the zero object, pledged and ... continued to work. This leads to the fact that after checking the object dereferencing of the zero object occurs. Sadness!
Most likely, in the case of a null reference, you should immediately exit the function or throw some special exception.
V6008 Null dereference of 'buffer' in function 'unlockSingleBuffer'. MetadataCache.java (410), MetadataCache.java (465)
private boolean lockBuffer(LlapBufferOrBuffers buffers, ....) { LlapAllocatorBuffer buffer = buffers.getSingleLlapBuffer(); if (buffer != null) {
And again a potential NPE. If the program reaches the
unlockSingleBuffer method, the
buffer object will be zero. Let's say it happened! Let's look at the
unlockSingleBuffer method and immediately on the first line we see that our object is dereferenced. Here we are!
Didn't follow the shift
V6034 Shift by the value of 'bitShiftsInWord - 1' could be inconsistent with the size of type: 'bitShiftsInWord - 1' = [-1 ... 30]. UnsignedInt128.java (1791)
private void shiftRightDestructive(int wordShifts, int bitShiftsInWord, boolean roundUp) { if (wordShifts == 0 && bitShiftsInWord == 0) { return; } assert (wordShifts >= 0); assert (bitShiftsInWord >= 0); assert (bitShiftsInWord < 32); if (wordShifts >= 4) { zeroClear(); return; } final int shiftRestore = 32 - bitShiftsInWord;
Possible offset by -1. If, for example,
wordShifts == 3 and
bitShiftsInWord == 0 come to the input of the method in question, then 1 << -1 will occur in the specified line. Is this planned?
V6034 Shift by the value of 'j' could be inconsistent with the size of type: 'j' = [0 ... 63]. IoTrace.java (272)
public void logSargResult(int stripeIx, boolean[] rgsToRead) { .... for (int i = 0, valOffset = 0; i < elements; ++i, valOffset += 64) { long val = 0; for (int j = 0; j < 64; ++j) { int ix = valOffset + j; if (rgsToRead.length == ix) break; if (!rgsToRead[ix]) continue; val = val | (1 << j);
In the specified line, the variable
j can take a value in the range [0 ... 63]. Because of this, the calculation of the value of
val in the loop may not happen as the developer intended. In the expression
(1 << j), the unit is of type
int , and, shifting it from 32 or more, we go beyond the bounds of the permissible. To remedy the situation, you must write
((long) 1 << j) .
Enthusiastic about logging
V6046 Incorrect format. A different number of format items is expected. Arguments not used: 1, 2. StatsSources.java (89)
private static ImmutableList<PersistedRuntimeStats> extractStatsFromPlanMapper (....) { .... if (stat.size() > 1 || sig.size() > 1) { StringBuffer sb = new StringBuffer(); sb.append(String.format( "expected(stat-sig) 1-1, got {}-{} ;",
When formatting a string through
String.format (), the developer confused the syntax. Bottom line: the passed parameters did not get into the resulting string. I can assume that in the previous task the developer worked on logging, from where he borrowed the syntax.
Stole the exception
V6051 The use of the 'return' statement in the 'finally' block can lead to the loss of unhandled exceptions. ObjectStore.java (9080)
private List<MPartitionColumnStatistics> getMPartitionColumnStatistics(....) throws NoSuchObjectException, MetaException { boolean committed = false; try { .... committed = commitTransaction(); return result; } catch (Exception ex) { LOG.error("Error retrieving statistics via jdo", ex); if (ex instanceof MetaException) { throw (MetaException) ex; } throw new MetaException(ex.getMessage()); } finally { if (!committed) { rollbackTransaction(); return Lists.newArrayList(); } } }
Returning something from the finally block is a very bad practice, and with this example we will see this.
In the
try block, the request is formed and the storage is accessed. The
committed variable defaults to
false and changes its state only after all successfully completed actions in the
try block. This means that if an exception occurs, our variable will always be
false .
Catch block caught an exception, slightly corrected and threw further. And when the time comes for the
finally block
, execution enters a condition from which we return the empty list outward. What does this return cost us? But it's worth the fact that all caught exceptions will never be thrown out and handled in an appropriate way. All those exceptions that are indicated in the method signature will never be thrown and are simply confusing.
A similar warning:
- V6051 The use of the 'return' statement in the 'finally' block can lead to the loss of unhandled exceptions. ObjectStore.java (808)
... other
V6009 Function 'compareTo' receives an odd argument. An object 'o2.getWorkerIdentity ()' is used as an argument to its own method. LlapFixedRegistryImpl.java (244)
@Override public List<LlapServiceInstance> getAllInstancesOrdered(....) { .... Collections.sort(list, new Comparator<LlapServiceInstance>() { @Override public int compare(LlapServiceInstance o1, LlapServiceInstance o2) { return o2.getWorkerIdentity().compareTo(o2.getWorkerIdentity());
Copy-paste, carelessness, rush and many other reasons to make this stupid mistake. When checking open source projects, errors of this kind are quite common. There is even a whole
article about it.
V6020 Divide by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java (265)
public static long divideUnsignedLong(long dividend, long divisor) { if (divisor < 0L) { return (compareUnsignedLong(dividend, divisor)) < 0 ? 0L : 1L; } if (dividend >= 0) {
Everything is quite simple here. A number of checks did not warn against dividing by 0.
More warnings:
- V6020 Mod by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java (309)
- V6020 Divide by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java (276)
- V6020 Divide by zero. The range of the 'divisor' denominator values includes zero. SqlMathUtil.java (312)
V6030 The method located to the right of the '|' operator will be called regardless of the value of the left operand. Perhaps, it is better to use '||'. OperatorUtils.java (573)
public static Operator<? extends OperatorDesc> findSourceRS(....) { .... List<Operator<? extends OperatorDesc>> parents = ....; if (parents == null | parents.isEmpty()) {
Instead of the logical operator || wrote a bit operator |. This means that the right side will be executed regardless of the result of the left side. Such a typo, in the case of
parents == null , will immediately lead to an NPE in the next logical subexpression.
V6042 The expression is checked for compatibility with type 'A' but is cast to type 'B'. VectorColumnAssignFactory.java (347)
public static VectorColumnAssign buildObjectAssign(VectorizedRowBatch outputBatch, int outColIndex, PrimitiveCategory category) throws HiveException { VectorColumnAssign outVCA = null; ColumnVector destCol = outputBatch.cols[outColIndex]; if (destCol == null) { .... } else if (destCol instanceof LongColumnVector) { switch(category) { .... case LONG: outVCA = new VectorLongColumnAssign() { .... } .init(.... , (LongColumnVector) destCol); break; case TIMESTAMP: outVCA = new VectorTimestampColumnAssign() { .... }.init(...., (TimestampColumnVector) destCol);
The classes in
question are LongColumnVector extends ColumnVector and
TimestampColumnVector extends ColumnVector . Checking our
destCol object for
LongColumnVector ownership clearly tells us that the object of this class will be inside the conditional statement. Despite this, we are casting to
TimestampColumnVector ! As you can see, these classes are different, not counting their common parent. As a result -
ClassCastException .
All the same can be said about type conversion to
IntervalDayTimeColumnVector :
- V6042 The expression is checked for compatibility with type 'A' but is cast to type 'B'. VectorColumnAssignFactory.java (390)
V6060 The 'var' reference was utilized before it was verified against null. Var.java (402), Var.java (395)
@Override public boolean equals(Object obj) { if (getClass() != obj.getClass()) {
Strange comparison of
var object with
null after dereferencing has occurred. In this context,
var and
obj are the same object (
var = (Var) obj ). Checking for
null implies that a null object may come. And in the case of calling
equals (null), we get immediately on the first line NPE instead of the expected
false . Alas, there is a check, but not there.
Similar suspicious moments using the object before the check takes place:
- V6060 The 'value' reference was utilized before it was verified against null. ParquetRecordReaderWrapper.java (168), ParquetRecordReaderWrapper.java (166)
- V6060 The 'defaultConstraintCols' reference was utilized before it was verified against null. HiveMetaStore.java (2539), HiveMetaStore.java (2530)
- V6060 The 'projIndxLst' reference was utilized before it was verified against null. RelOptHiveTable.java (683), RelOptHiveTable.java (682)
- V6060 The 'oldp' reference was utilized before it was verified against null. ObjectStore.java (4343), ObjectStore.java (4339)
- and so on...
Conclusion
Anyone who was a little interested in Big Data, he hardly missed the significance of Apache Hive. The project is popular and large enough, and in its composition has more than 6500 source code files (* .java). The code was written by many developers for many years and, as a result, the static analyzer has something to find. This once again confirms that static analysis is extremely important and useful in the development of medium and large projects!
Note. Such one-time checks demonstrate the capabilities of a static code analyzer, but are a completely wrong way to use it. This idea is presented in more detail
here and
here . Use the analysis regularly!
When checking the Hive, a sufficient number of defects and suspicious moments were detected. If this article draws the attention of the Apache Hive development team, we will be pleased to contribute to this difficult task.
It is impossible to imagine Apache Hive without Apache Hadoop, so it is likely that the unicorn from PVS-Studio will look there too. But that's all for today, but for now
download the analyzer and check your own projects.
If you want to share this article with an English-speaking audience, then please use the link to the translation: Maxim Stefanov.
PVS-Studio Visits Apache Hive .